Skip to content

rasterize: propagate like.attrs/coords and emit _FillValue (#2018)#2024

Merged
brendancol merged 2 commits into
mainfrom
deep-sweep-metadata-rasterize-2026-05-17-01
May 18, 2026
Merged

rasterize: propagate like.attrs/coords and emit _FillValue (#2018)#2024
brendancol merged 2 commits into
mainfrom
deep-sweep-metadata-rasterize-2026-05-17-01

Conversation

@brendancol
Copy link
Copy Markdown
Contributor

Summary

Closes #2018.

rasterize() was returning a DataArray with empty attrs and freshly-built linspace coords on every code path, even when the caller passed like=template. The result silently broke chained spatial pipelines.

  • slope(rasterize(gdf, like=elevation)) saw no crs, no res, no transform, and recomputed cell size from coords that were not bit-identical to elevation. Same class of bug as the sky_view_factor: horizon angle ignores cell size, wrong by factor of cell_size #1407 sky_view_factor cellsize issue.
  • rasterize(..., fill=-9999, dtype=np.int32) produced an integer raster with no _FillValue and no nodatavals attr, so downstream tools could not identify nodata pixels.

All four backends (numpy, cupy, dask+numpy, dask+cupy) route through the same final xr.DataArray(...) constructor, so the fix is at one site.

Changes

  • _extract_grid_from_like now also returns the template's x/y coords and attrs.
  • The final xr.DataArray(...) constructor reuses like.coords directly when the resolved output grid matches like, so the output is bit-identical to the template and xr.align keeps working.
  • like.attrs are copied onto the output (defensive dict(...) so mutating the output doesn't leak into the template).
  • fill is emitted as attrs['_FillValue'] and attrs['nodatavals'] = (fill,) when fill is not NaN. The default NaN fill leaves attrs empty (no pollution).

Test plan

  • Added TestMetadataPropagation to xrspatial/tests/test_rasterize.py with 9 cases:
    • test_like_propagates_attrs -- crs/transform/res from like land on output
    • test_like_preserves_coords_bit_identical -- output x/y coords are np.array_equal to like.coords
    • test_like_attrs_isolated_from_template -- mutating output attrs does not mutate like.attrs
    • test_fill_value_recorded_when_not_nan -- fill=-9999 -> _FillValue == -9999, nodatavals == (-9999,)
    • test_fill_value_omitted_for_nan -- default fill leaves _FillValue and nodatavals absent
    • test_no_like_no_attrs_pollution -- no like, NaN fill -> attrs == {}
    • test_like_attrs_propagated_dask -- dask+numpy backend behaves identically
    • test_like_attrs_propagated_cupy -- cupy backend behaves identically (skipped if no CUDA)
    • test_like_attrs_propagated_dask_cupy -- dask+cupy backend behaves identically (skipped if no CUDA)
  • Full rasterize suite passes: 193 passed, 2 skipped.
  • All 9 new tests pass, including across the four backends (numpy, cupy, dask+numpy, dask+cupy).

When the caller passed ``like=template``, ``rasterize()`` dropped all of
the template's ``attrs`` (including ``crs``, ``res``, ``transform``,
``nodatavals``) and rebuilt the output coords via ``np.linspace`` from
re-derived bounds.  The output then no longer ``xr.align``-ed with the
template, and chained pipelines like
``slope(rasterize(gdf, like=elevation))`` silently saw a no-CRS, no-res
raster and recomputed cell-size from coords -- the same class of bug as
the #1407 sky_view_factor cellsize issue.

Three fixes, all at the same site (line 2227, where the output
``xr.DataArray`` is built):

1. ``_extract_grid_from_like`` now also returns the input ``x`` and
   ``y`` coords and ``attrs``.  ``rasterize()`` copies ``like.attrs``
   onto the output.

2. When the resolved output grid matches ``like`` (same width, height,
   and bounds), the output reuses ``like.coords['x']`` and
   ``like.coords['y']`` directly so the result is bit-identical to the
   template and ``xr.align`` keeps working.

3. The ``fill`` value is now emitted as ``attrs['_FillValue']`` and
   ``attrs['nodatavals']`` when ``fill`` is not NaN.  Downstream tools
   (``to_geotiff``, custom masks) can identify nodata pixels.

All four backends (numpy, cupy, dask+numpy, dask+cupy) route through
the same final ``xr.DataArray`` constructor, so the fix is in one
place and behaves identically across backends.

Adds ``TestMetadataPropagation`` to ``test_rasterize.py`` with 9 cases
covering attrs propagation, bit-identical coord reuse, fill-value
emission, isolation from the template's attrs dict, and parity across
all four backends.

Closes #2018.
@github-actions github-actions Bot added the performance PR touches performance-sensitive code label May 18, 2026
@brendancol
Copy link
Copy Markdown
Contributor Author

PR Review: rasterize: propagate like.attrs/coords and emit _FillValue (#2018)

Nice fix — the bug is real and the one-site fix at the final xr.DataArray constructor is the right call. I found one blocker around conflicting nodata attrs carried over from like, a few suggestions around NaN detection and coord coverage, plus minor nits.

Blockers (must fix before merge)

  • xrspatial/rasterize.py:2255-2259 — when fill defaults to NaN and like.attrs already carries _FillValue, nodatavals, or nodata from a previous round-trip, those stale sentinels are copied straight through and then NOT overwritten (because fill_is_nan skips the emit branch). Reproduces with:
    like = xr.DataArray(..., attrs={'_FillValue': -9999, 'nodatavals': (-9999,), 'crs': 'EPSG:4326'})
    out = rasterize([(box(2,2,8,8), 1.0)], like=like)  # fill defaults to NaN
    # out.attrs['_FillValue'] == -9999, but actual unwritten pixels are NaN.
    The geotiff writer's _resolve_nodata_attr (xrspatial/geotiff/_attrs.py:640) checks nodata first, then nodatavals, then _FillValue, so a to_geotiff(out) round-trip would tag pixels with the wrong sentinel. Strip any inherited _FillValue / nodatavals / nodata from out_attrs before deciding whether to emit a fresh pair based on fill. Same logic should apply on the non-NaN branch — overwrite nodata too, not just _FillValue and nodatavals, otherwise an inherited nodata=-9999 outranks the freshly-set _FillValue=0.

Suggestions (should fix, not blocking)

  • xrspatial/rasterize.py:2256isinstance(fill, float) and np.isnan(fill) misses numpy scalar floats. rasterize(..., fill=np.float32(np.nan)) currently emits _FillValue=nan even though semantically NaN. Use a try/except wrapper or np.isnan(np.asarray(fill, dtype=float)) so any numeric NaN is treated the same.

  • xrspatial/rasterize.py:1948-1949,1965-1967_extract_grid_from_like only returns like.coords['x'] and like.coords['y']. A like produced by rioxarray typically also carries non-dimension coords like spatial_ref (the CRS coord) and any band/time coords. These get dropped on the output, so rio.write_crs() followed by rasterize loses the CRS coord even though attrs['crs'] survives. Propagate all of like.coords (or at minimum any scalar non-dim coords that don't conflict with the new y/x sizes).

  • xrspatial/rasterize.py:2236-2240 — the bit-identical reuse predicate checks final_bounds == like_bounds with float equality. After _extract_grid_from_like derives like_bounds via float(np.min(x)) - px/2, repeated float arithmetic can produce bounds that differ by 1 ULP from a caller-supplied identical-looking tuple. Either compare with np.allclose(...) or just key the reuse on width == like_width and height == like_height and bounds is None (caller didn't override bounds → we know final_bounds is like_bounds).

  • xrspatial/tests/test_rasterize.py:1681-1683test_fill_value_recorded_when_not_nan passes fill=-9999 as a Python int and asserts _FillValue == -9999. Add an assertion that the stored sentinel's type is compatible with final_dtype (or at least round-trips correctly through to_geotiff). The geotiff path coerces to float, so this is mostly latent, but the dtype mismatch is worth pinning.

  • xrspatial/tests/test_rasterize.py — no test for the round-trip described in the PR body: rasterize → to_geotiff → read back → nodata matches. That's the use case rasterize: drop attrs, recompute coords, and don't emit _FillValue when like= is given #2018 calls out. A single test that writes to a tmp_path, reads back, and asserts the file's nodata sentinel matches fill would lock in the user-visible fix.

  • xrspatial/tests/test_rasterize.py — no test covering the inherited-nodata blocker above. Add one for like.attrs={'_FillValue': -9999} with fill=np.nan and assert _FillValue is NOT in result.attrs (or is NaN).

Nits (optional improvements)

  • xrspatial/rasterize.py:1965-1967like.coords['x'].copy() and like.coords['y'].copy() defensively copy, but xr.DataArray(..., coords={'x': ...}) already wraps the input in a new IndexVariable. The copy is redundant for the reuse branch (and it defeats the bit-identical claim slightly, since .copy() allocates new buffers — still equal under array_equal, but a wasted allocation on every like= call).

  • xrspatial/rasterize.py:2255dict(like_attrs) if like_attrs is not None else {} is fine, but _extract_grid_from_like already does dict(like.attrs) on line 1967. The second dict(...) is redundant since like_attrs is already a fresh dict. Pick one site for the defensive copy.

  • xrspatial/tests/test_rasterize.py:1695-1697 — the comment "When the caller passed like, the output should reuse its coords exactly (not rebuild via linspace)" reads slightly long. Trim to one line.

  • xrspatial/rasterize.py:2252-2253 — comment says "for xrspatial internal compatibility" but nodatavals is rioxarray's convention, not xrspatial's. The actual xrspatial-native key is nodata. Either emit nodata (matching _resolve_nodata_attr's primary key) or reword the comment to say "for rioxarray compatibility".

What looks good

  • Single fix site for all four backends — they really do all route through the final xr.DataArray(...) constructor, so the parity claim holds.
  • Defensive dict(like.attrs) so output mutation doesn't leak into the template — caught a real footgun.
  • The like.coords['x'].sizes['x'] == final_width size gate handles the case where the caller overrides width/height/bounds while still passing like (you fall back to fresh linspace).
  • 9 tests is good coverage for what's there; the cross-backend trio (numpy/dask/cupy/dask+cupy) catches the silent-drop case the metadata sweep flagged.

Checklist

  • Algorithm matches reference/paper (N/A — metadata propagation, not an algorithm change)
  • All implemented backends produce consistent results
  • NaN handling is correct (modulo the float32 scalar suggestion)
  • Edge cases are covered by tests (inherited _FillValue from like missing)
  • Dask chunk boundaries handled correctly (same final constructor)
  • No premature materialization or unnecessary copies (modulo the redundant .copy() nit)
  • Benchmark exists or is not needed (metadata-only change)
  • README feature matrix updated (N/A — no new function)
  • Docstrings present and accurate

Strip inherited nodata / _FillValue / nodatavals from like.attrs before
emitting a fresh triplet keyed off the actual fill, so stale sentinels
from a prior round-trip can't outlive the new fill. Detect numpy-typed
NaN via float() + np.isnan() so np.float32(np.nan) is treated as NaN.
Propagate non-dim coords (e.g. rioxarray's spatial_ref). Replace the
float-equal bounds check in the coord-reuse predicate with a check on
"caller didn't override bounds/resolution". Adds tests for inherited
nodata, numpy-scalar NaN, spatial_ref propagation, and a geotiff
round-trip pinning the user-visible #2018 fix.
@brendancol brendancol merged commit edec07c into main May 18, 2026
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

performance PR touches performance-sensitive code

Projects

None yet

Development

Successfully merging this pull request may close these issues.

rasterize: drop attrs, recompute coords, and don't emit _FillValue when like= is given

1 participant