Skip to content

Use CUDA atomics in GPU rasterize kernels (#2167)#2198

Merged
brendancol merged 5 commits into
mainfrom
issue-2167
May 21, 2026
Merged

Use CUDA atomics in GPU rasterize kernels (#2167)#2198
brendancol merged 5 commits into
mainfrom
issue-2167

Conversation

@brendancol
Copy link
Copy Markdown
Contributor

Summary

  • The GPU point and line rasterize kernels used non-atomic
    read-modify-write on per-pixel state, so overlapping geometries
    returned values that disagreed with the numpy backend and varied
    between runs. Switch them to CUDA atomics for the six built-in
    aggregators (last, first, sum, count, min, max).
  • first / last use a two-pass scheme: pass 1 resolves the winning
    input index with atomic min/max on an index buffer; pass 2 stamps
    the value in.
  • Add regression tests covering coincident points, crossing line
    segments, and duplicate segments, with both cross-backend parity
    and per-run determinism checks.

Backends touched: cupy, dask+cupy. numpy and dask+numpy paths are
unchanged.

Test plan

  • New tests in xrspatial/tests/test_rasterize_gpu_race_2167.py
    pass on the local GPU.
  • Existing rasterize tests still pass on the local GPU.

Closes #2167

The GPU point and line burn kernels did non-atomic read-modify-write
on per-pixel state, so overlapping geometries produced nondeterministic
results that disagreed with the numpy backend.

Switch the per-pixel write strategy to CUDA atomics for the six
built-in aggregators:

- sum, count: cuda.atomic.add on the output buffer.
- min, max: cuda.atomic.min / cuda.atomic.max on the output buffer.
- first, last: two-pass. Pass 1 resolves the per-pixel winning
  input index with cuda.atomic.min / cuda.atomic.max on an index
  buffer; pass 2 stamps the winner's value into the output.

The per-merge buffers are initialised up front (zero / +inf / -inf /
fill) and a host-side post-pass blends fill back into untouched
pixels.

User-supplied merge callables keep the previous non-atomic closure
path, since the public merge_fn signature has no atomic equivalent.

The numpy and dask+numpy backends are unchanged.
Cover the six built-in aggregators (last, first, sum, count, min,
max) across three overlap scenarios: coincident points, crossing
line segments, and duplicate segments. Assert that the cupy output
matches the numpy output exactly and that repeated cupy runs
produce identical arrays.
@github-actions github-actions Bot added the performance PR touches performance-sensitive code label May 20, 2026
Copy link
Copy Markdown
Contributor Author

@brendancol brendancol left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR Review: Use CUDA atomics in GPU rasterize kernels (#2167)

Blockers (must fix before merge)

None.

Suggestions (should fix, not blocking)

  • xrspatial/rasterize.py:1569-1577 -- In _run_cupy the boundary-burn block uses poly_geoms and poly_ids from the polygon-fill block above. The boundary path runs whenever all_touched and poly_geoms is truthy, and poly_geoms comes from _classify_geometries earlier, so it's defined even if no edges crossed the raster. The same shape appears in _rasterize_tile_cupy:2008-2031, where poly_geoms is bound inside if poly_wkb: and only used when poly_wkb and all_touched. Safe today, fragile to future refactors. Worth nesting the boundary block inside the polygon block, or adding a comment, so the dependency is explicit.
  • xrspatial/rasterize.py:1083-1109 -- _apply_merge_gpu now branches on atomic_mode at compile time, so each merge produces a separate compiled kernel. Fine. But written is passed into every kernel and only read in the legacy user-callable branch. For atomic modes it's an (H, W) int8 allocation that the kernels never touch. Worth skipping the written allocation for atomic modes. Memory is small but it's a clean win.

Nits (optional improvements)

  • xrspatial/rasterize.py:1079-1080 -- The comment block says `pass_id` selects which branch executes (kernels are compiled once per pass). The implementation has no pass_id parameter; pass 1 and pass 2 are separate kernels named ..._gpu and ..._gpu_pass2. Update the comment to match the code.
  • xrspatial/tests/test_rasterize_gpu_race_2167.py -- An all_touched=True scenario with overlapping polygon boundaries would be a useful addition. The cross-backend parity test would already catch a regression there, but an explicit scenario makes the coverage clearer for the next reader.

What looks good

  • The atomic strategy follows the fix sketched in issue #2167: atomic add for sum/count, atomic min/max on the value buffer for min/max, two-pass index-resolution for first/last.
  • The numpy and dask+numpy paths are untouched.
  • The kernel cache key now includes merge_name, so atomic and non-atomic kernels coexist without collision.
  • Tests cover all six aggregators across three overlap scenarios, with both cross-backend parity and per-run determinism.
  • The legacy non-atomic closure path is kept for user callables, which is the right call given the public merge_fn signature.
  • _gpu_finalize_buffers handles the +inf / -inf initialisers for min/max so the user-visible fill semantics stay consistent with numpy.

Checklist

  • Algorithm matches the strategy described in issue #2167.
  • All six built-in aggregators produce results consistent with the numpy backend (verified by tests).
  • NaN / fill handling is correct (finalize blends fill into untouched pixels).
  • Edge cases are covered (coincident points, crossing lines, duplicate segments).
  • [n/a] Dask chunk boundaries: tiles partition the output grid, so per-tile atomicity is sufficient. dask+cupy uses the same kernel set via _rasterize_tile_cupy.
  • No premature materialization. Device transfers are staged once and reused for pass 2.
  • [n/a] Benchmark: this is a correctness fix, not a perf change.
  • [n/a] README feature matrix: no public API changed.
  • Docstrings updated (_ensure_gpu_kernels and _run_cupy describe the atomic path).

- Skip the (H, W) ``written`` allocation when the atomic kernel path
  is selected. Atomic modes never read or write ``written``; a (1, 1)
  placeholder satisfies the kernel signature without spending H*W
  bytes on dead storage.
- Nest the all_touched boundary launch inside the ``if poly_wkb:``
  block in ``_rasterize_tile_cupy`` so the dependency on
  ``poly_geoms`` / ``poly_ids`` is local. Add a parallel comment to
  ``_run_cupy`` explaining why the same shape is safe there
  (``poly_geoms`` comes from ``_classify_geometries`` unconditionally).
- Fix the stale ``pass_id`` comment in ``_ensure_gpu_kernels``; pass 1
  and pass 2 are separate compiled kernels, not branches selected by a
  runtime ``pass_id``.
- Add a shared-boundary polygon scenario (cross-backend parity plus
  per-run determinism) to the regression tests. This is the polygon
  analogue of the line-overlap case and exercises the all_touched
  scanline-plus-boundary write pattern.
Copy link
Copy Markdown
Contributor Author

@brendancol brendancol left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Follow-up review (after 2842709)

All four findings from the prior review are addressed.

Disposition

  • Suggestion 1 (fragile poly_geoms dependency in boundary block) -- fixed. _rasterize_tile_cupy now stages the boundary launch inside the if poly_wkb: block so the dependency is local. _run_cupy got an explicit comment noting that poly_geoms comes from _classify_geometries unconditionally and the same shape is safe there.
  • Suggestion 2 (written allocation wasted for atomic modes) -- fixed. Atomic modes now get a (1, 1) int8 placeholder; the kernels never touch it for those modes, and the legacy user-callable path still gets a full (H, W) allocation.
  • Nit 1 (stale pass_id comment) -- fixed. Comment now describes the actual two-kernel layout.
  • Nit 2 (no all_touched=True polygon scenario) -- fixed. Added shared-boundary polygon tests covering cross-backend parity and per-run determinism for all six aggregators.

Verification

  • 50 tests pass in xrspatial/tests/test_rasterize_gpu_race_2167.py (up from 38 in the first pass).
  • 325 tests pass across the full rasterize suite (test_rasterize.py, test_rasterize_accuracy.py, test_rasterize_coverage_2026_05_17.py, test_rasterize_tile_props_slice_2020.py, test_rasterize_gpu_race_2167.py).

@brendancol
Copy link
Copy Markdown
Contributor Author

Heads up on CI: the three failing run (..., 3.14) jobs are failing on test_polygonize_dask_multi_chunk_default_tolerance and test_polygonize_dask_multi_chunk_strict_float in xrspatial/tests/test_polygonize.py (assert 2 == 3).

These two failures are pre-existing on main, not introduced by the atomics work in this PR. The same two failures show up on the unrelated PR #2196 against the same main head. Confirmed by checking recent main runs — the polygonize tests have been red there independently.

Everything in this PR's surface area (the new GPU rasterize race tests, the existing rasterize suite) passes locally on the merged branch. Happy to address the polygonize failures in a separate change if useful, but they did not regress here.

# Conflicts:
#	xrspatial/rasterize.py
@brendancol brendancol merged commit bd6583a into main May 21, 2026
4 of 5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

performance PR touches performance-sensitive code

Projects

None yet

Development

Successfully merging this pull request may close these issues.

GPU point/line rasterize kernels have racy overlap writes

1 participant