Skip to content

geotiff: add read-finalization helpers (PR B of #2162)#2200

Merged
brendancol merged 4 commits into
mainfrom
issue-2177
May 20, 2026
Merged

geotiff: add read-finalization helpers (PR B of #2162)#2200
brendancol merged 4 commits into
mainfrom
issue-2177

Conversation

@brendancol
Copy link
Copy Markdown
Contributor

Summary

Closes #2177. Wave 1 of #2162. Sibling wave 1 PR #2175 is in progress against disjoint files.

Adds two private helpers in xrspatial/geotiff/_attrs.py for the read-finalization pipeline that is duplicated across the four read backends. No call sites are migrated here; that work belongs to waves 2 and 3 (#2178, #2179, #2180). The helpers are dead code until those PRs land.

  • _finalize_eager_read(arr, *, geo_info, nodata, mask_sentinel, mask_nodata, dtype, window, name, ...) validates geo_info, populates attrs, masks sentinel pixels, casts dtype, sets the nodata lifecycle attrs (pixels_present as bool), and returns the xarray.DataArray. mask_sentinel is a separate parameter because the three GPU eager sites derive it three different ways (MinIsWhite inversion, CPU-fallback _mask_nodata, raw nodata).
  • _finalize_lazy_read_attrs(*, geo_info, nodata, mask_nodata, dtype, window, ...) validates geo_info, populates attrs, sets nodata attrs with pixels_present=None per the documented dask contract from geotiff: split overloaded masked_nodata into separate nodata lifecycle signals #2135. Returns the attrs dict only; the caller assembles the dask graph and builds the DataArray.
  • _validate_read_geo_info runs first in both helpers, so partial attrs cannot leak when validation fails.

Test plan

xrspatial/geotiff/tests/test_finalization_helpers_2162.py synthesises GeoInfo fixtures and exercises the helpers in isolation:

  • nodata, nodata_pixels_present, nodata_dtype_cast, masked_nodata, and georef_status get set correctly on the eager helper for float and int input dtypes.
  • mask_nodata=False skips masking but still surfaces pixels_present.
  • mask_sentinel != nodata (GPU MinIsWhite inversion case) routes through the masking step.
  • Lazy helper produces the same attrs minus nodata_pixels_present.
  • Both helpers raise on ambiguous geo_info before any attrs are written; the caller's seed dict stays untouched.
  • allow_unparseable_crs=True bypasses the validator on both helpers.
  • Signature pinning: eager helper takes mask_sentinel, lazy helper deliberately does not.

21 tests pass locally. Existing _attrs / nodata_lifecycle / georef_status tests still pass.

…ze_lazy_read_attrs (#2177)

Wave 1 of #2162. Add two private helpers in xrspatial/geotiff/_attrs.py
that capture the read-finalization pipelines duplicated across backends.
The helpers are dead code until waves 2 (#2178 dask, #2179 eager) and
3 (#2180 VRT, GPU) consume them.

_finalize_eager_read: validates geo_info, populates attrs, applies the
sentinel mask, casts dtype, sets nodata attrs (with pixels_present as
a bool), returns an xarray.DataArray. mask_sentinel is a parameter
because the three GPU eager sites derive it three different ways
(MinIsWhite inversion, CPU fallback, raw nodata).

_finalize_lazy_read_attrs: validates geo_info, populates attrs, sets
nodata attrs with pixels_present=None per the documented dask contract
from #2135 (a strict per-chunk reduction would force eager .compute()).
Returns the attrs dict only; the caller assembles the dask graph and
builds the DataArray itself.

_validate_read_geo_info runs first in both helpers so partial attrs
do not leak on validation failure.

No public API change. No call sites migrated. Helper signatures are
frozen so wave 2 and 3 can depend on them.
@github-actions github-actions Bot added the performance PR touches performance-sensitive code label May 20, 2026
Copy link
Copy Markdown
Contributor Author

@brendancol brendancol left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR Review: geotiff: add read-finalization helpers (PR B of #2162)

Wave 1 of #2162. Adds two private helpers in xrspatial/geotiff/_attrs.py plus a 21-test exercise file. No call sites are migrated here. Reviewed against the issue body contract.

Blockers (must fix before merge)

None.

Suggestions (should fix, not blocking)

  • xrspatial/geotiff/_attrs.py:1276: attrs_in is shallow-copied via dict(attrs_in). If a caller passes a seed dict that contains a list or dict value (e.g. extra_tags) and later mutates that nested value, the change leaks onto the returned DataArray's attrs. Wave 2 will probably pass freshly-built dicts so this isn't likely to bite in practice, but a docstring note ("shallow copy; nested values are shared") would document the gotcha for future migrators.

  • xrspatial/geotiff/_attrs.py:1392-1395: the lazy helper's dtype parameter does double duty. It's the graph dtype (for masked derivation) AND the value recorded as nodata_dtype_cast. The existing dask backend keeps those separate (target_dtype for masked, raw caller dtype= for cast). The docstring now flags this as a wave-2 migration concern, which is the right call given the frozen-signature constraint. Worth pinning a test that documents which interpretation we shipped so wave 2 catches any drift early. The current test_lazy_int_graph_dtype_keeps_masked_false exercises the int-dtype path but doesn't assert the dtype_cast attr; adding assert attrs['nodata_dtype_cast'] == 'int16' to that test would lock in the conflated semantics.

Nits (optional improvements)

  • xrspatial/geotiff/_attrs.py:1267: the _validate_read_geo_info(...) block is identical between the two helpers. Could be extracted into one private call shared by both, but the duplication is small (4 lines x 2) and pulling it out would obscure the validate-first contract. Leave as is.

  • xrspatial/geotiff/_attrs.py:1328: the import xarray as xr inside _finalize_eager_read runs every call. xarray is already imported at module-load time in __init__.py, so a top-level import xarray as xr in _attrs.py would be free. The local import is defensible (matches _validate_dtype_cast's pattern), but xarray is already a hard dep of the package.

  • xrspatial/geotiff/tests/test_finalization_helpers_2162.py:316: the seed-dict-untouched assertion uses seed == {'sentinel_marker': True} which is correct, but a stronger check is 'sentinel_marker' in seed and len(seed) == 1 to catch the case where the validator partially populates and then re-raises (currently impossible but defensive).

What looks good

  • Validate-first ordering is explicit and tested. The seed-dict-untouched test catches the partial-attrs leak the issue body called out.
  • mask_sentinel != nodata (the GPU MinIsWhite case) is tested directly with a fixture that diverges the two values.
  • Eager helper handles int -> float64 promotion when the sentinel matches, matching the existing inline block field-for-field.
  • The lazy helper's pixels_present=None is documented as the dask contract from #2135 and tested via the absent-from-attrs assertion.
  • Signature pinning tests catch the case where a future refactor adds mask_sentinel to the lazy helper or drops it from the eager helper.
  • Docstrings call out the wave-2 / wave-3 migration plans so the next PR doesn't have to re-derive the design.

Checklist

  • Helper signatures match the issue body's contract.
  • _validate_read_geo_info runs first in both helpers.
  • Partial attrs do not leak on validation failure (tested).
  • Eager helper covers float + int input dtypes.
  • mask_nodata=False opt-out tested.
  • mask_sentinel != nodata (GPU MinIsWhite) tested.
  • Lazy helper omits nodata_pixels_present per #2135.
  • Tests pass locally (21/21).
  • Existing _attrs / nodata_lifecycle / georef_status tests still pass.

- Move ``import xarray as xr`` to module scope. xarray is already a hard
  dependency, so the per-call local import was unnecessary indirection.
- Document the shallow-copy semantics of ``attrs_in`` on both helpers so
  wave 2 / wave 3 migrators know nested values are shared with the
  caller's seed dict.
- Pin the int-dtype ``nodata_dtype_cast`` value in the lazy helper test
  so wave 2 catches any drift from the conflated-dtype semantics.
- Strengthen the seed-dict-untouched assertions with a ``len(seed) == 1``
  check so a future partial-leak that adds new keys is caught even if
  the original key still matches.
Copy link
Copy Markdown
Contributor Author

@brendancol brendancol left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR Review: follow-up pass (PR B of #2162)

Second pass after the initial review findings were addressed. Diff is +54 / -20 across the helper module and the test file.

Blockers

None.

Suggestions

None new. Initial-pass suggestions were applied:

  • attrs_in shallow-copy semantics documented on both helpers (_attrs.py:1265-1276, _attrs.py:1389-1405).
  • test_lazy_int_graph_dtype_keeps_masked_false now asserts nodata_dtype_cast == 'int16' to lock in the wave-1 conflated-dtype contract.

Nits

None new. Initial-pass nits were applied where actionable:

  • import xarray as xr now lives at module scope.
  • Both seed-dict-untouched assertions now also check len(seed) == 1.
  • The dedup nit on the _validate_read_geo_info block was kept as-is (the initial review itself flagged "leave as is" for that one).

What looks good in this pass

  • Diff is minimal and targeted, no scope creep.
  • All 21 helper tests still pass. 99 related tests (_attrs, nodata_lifecycle, georef_status, attrs_contract) also still pass with the module-level xarray import.
  • Docstring notes on attrs_in shallow-copy are short and concrete; they call out the wave-2 / wave-3 caller responsibility without hedging.

Checklist (carry-over from initial pass)

  • Helper signatures match the issue body's contract.
  • _validate_read_geo_info runs first in both helpers.
  • Partial attrs do not leak on validation failure (tested with both equality and length checks).
  • Eager helper covers float + int input dtypes.
  • mask_nodata=False opt-out tested.
  • mask_sentinel != nodata (GPU MinIsWhite) tested.
  • Lazy helper omits nodata_pixels_present per #2135.
  • Conflated dtype semantics on the lazy helper pinned with an explicit assertion.
  • Tests pass locally.
  • Existing tests still pass.

Ready for merge once CI is green.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

performance PR touches performance-sensitive code

Projects

None yet

Development

Successfully merging this pull request may close these issues.

GeoTIFF: add shared read-finalization helpers (PR B of #2162)

1 participant