geotiff: add read-finalization helpers (PR B of #2162)#2200
Conversation
…ze_lazy_read_attrs (#2177) Wave 1 of #2162. Add two private helpers in xrspatial/geotiff/_attrs.py that capture the read-finalization pipelines duplicated across backends. The helpers are dead code until waves 2 (#2178 dask, #2179 eager) and 3 (#2180 VRT, GPU) consume them. _finalize_eager_read: validates geo_info, populates attrs, applies the sentinel mask, casts dtype, sets nodata attrs (with pixels_present as a bool), returns an xarray.DataArray. mask_sentinel is a parameter because the three GPU eager sites derive it three different ways (MinIsWhite inversion, CPU fallback, raw nodata). _finalize_lazy_read_attrs: validates geo_info, populates attrs, sets nodata attrs with pixels_present=None per the documented dask contract from #2135 (a strict per-chunk reduction would force eager .compute()). Returns the attrs dict only; the caller assembles the dask graph and builds the DataArray itself. _validate_read_geo_info runs first in both helpers so partial attrs do not leak on validation failure. No public API change. No call sites migrated. Helper signatures are frozen so wave 2 and 3 can depend on them.
brendancol
left a comment
There was a problem hiding this comment.
PR Review: geotiff: add read-finalization helpers (PR B of #2162)
Wave 1 of #2162. Adds two private helpers in xrspatial/geotiff/_attrs.py plus a 21-test exercise file. No call sites are migrated here. Reviewed against the issue body contract.
Blockers (must fix before merge)
None.
Suggestions (should fix, not blocking)
-
xrspatial/geotiff/_attrs.py:1276:attrs_inis shallow-copied viadict(attrs_in). If a caller passes a seed dict that contains a list or dict value (e.g.extra_tags) and later mutates that nested value, the change leaks onto the returned DataArray's attrs. Wave 2 will probably pass freshly-built dicts so this isn't likely to bite in practice, but a docstring note ("shallow copy; nested values are shared") would document the gotcha for future migrators. -
xrspatial/geotiff/_attrs.py:1392-1395: the lazy helper'sdtypeparameter does double duty. It's the graph dtype (formaskedderivation) AND the value recorded asnodata_dtype_cast. The existing dask backend keeps those separate (target_dtypefor masked, raw callerdtype=for cast). The docstring now flags this as a wave-2 migration concern, which is the right call given the frozen-signature constraint. Worth pinning a test that documents which interpretation we shipped so wave 2 catches any drift early. The currenttest_lazy_int_graph_dtype_keeps_masked_falseexercises the int-dtype path but doesn't assert the dtype_cast attr; addingassert attrs['nodata_dtype_cast'] == 'int16'to that test would lock in the conflated semantics.
Nits (optional improvements)
-
xrspatial/geotiff/_attrs.py:1267: the_validate_read_geo_info(...)block is identical between the two helpers. Could be extracted into one private call shared by both, but the duplication is small (4 lines x 2) and pulling it out would obscure the validate-first contract. Leave as is. -
xrspatial/geotiff/_attrs.py:1328: theimport xarray as xrinside_finalize_eager_readruns every call.xarrayis already imported at module-load time in__init__.py, so a top-levelimport xarray as xrin_attrs.pywould be free. The local import is defensible (matches_validate_dtype_cast's pattern), but xarray is already a hard dep of the package. -
xrspatial/geotiff/tests/test_finalization_helpers_2162.py:316: the seed-dict-untouched assertion usesseed == {'sentinel_marker': True}which is correct, but a stronger check is'sentinel_marker' in seed and len(seed) == 1to catch the case where the validator partially populates and then re-raises (currently impossible but defensive).
What looks good
- Validate-first ordering is explicit and tested. The seed-dict-untouched test catches the partial-attrs leak the issue body called out.
mask_sentinel != nodata(the GPU MinIsWhite case) is tested directly with a fixture that diverges the two values.- Eager helper handles int -> float64 promotion when the sentinel matches, matching the existing inline block field-for-field.
- The lazy helper's
pixels_present=Noneis documented as the dask contract from #2135 and tested via the absent-from-attrs assertion. - Signature pinning tests catch the case where a future refactor adds
mask_sentinelto the lazy helper or drops it from the eager helper. - Docstrings call out the wave-2 / wave-3 migration plans so the next PR doesn't have to re-derive the design.
Checklist
- Helper signatures match the issue body's contract.
-
_validate_read_geo_inforuns first in both helpers. - Partial attrs do not leak on validation failure (tested).
- Eager helper covers float + int input dtypes.
-
mask_nodata=Falseopt-out tested. -
mask_sentinel != nodata(GPU MinIsWhite) tested. - Lazy helper omits
nodata_pixels_presentper #2135. - Tests pass locally (21/21).
- Existing
_attrs/nodata_lifecycle/georef_statustests still pass.
- Move ``import xarray as xr`` to module scope. xarray is already a hard dependency, so the per-call local import was unnecessary indirection. - Document the shallow-copy semantics of ``attrs_in`` on both helpers so wave 2 / wave 3 migrators know nested values are shared with the caller's seed dict. - Pin the int-dtype ``nodata_dtype_cast`` value in the lazy helper test so wave 2 catches any drift from the conflated-dtype semantics. - Strengthen the seed-dict-untouched assertions with a ``len(seed) == 1`` check so a future partial-leak that adds new keys is caught even if the original key still matches.
brendancol
left a comment
There was a problem hiding this comment.
PR Review: follow-up pass (PR B of #2162)
Second pass after the initial review findings were addressed. Diff is +54 / -20 across the helper module and the test file.
Blockers
None.
Suggestions
None new. Initial-pass suggestions were applied:
attrs_inshallow-copy semantics documented on both helpers (_attrs.py:1265-1276,_attrs.py:1389-1405).test_lazy_int_graph_dtype_keeps_masked_falsenow assertsnodata_dtype_cast == 'int16'to lock in the wave-1 conflated-dtype contract.
Nits
None new. Initial-pass nits were applied where actionable:
import xarray as xrnow lives at module scope.- Both seed-dict-untouched assertions now also check
len(seed) == 1. - The dedup nit on the
_validate_read_geo_infoblock was kept as-is (the initial review itself flagged "leave as is" for that one).
What looks good in this pass
- Diff is minimal and targeted, no scope creep.
- All 21 helper tests still pass. 99 related tests (
_attrs,nodata_lifecycle,georef_status,attrs_contract) also still pass with the module-level xarray import. - Docstring notes on
attrs_inshallow-copy are short and concrete; they call out the wave-2 / wave-3 caller responsibility without hedging.
Checklist (carry-over from initial pass)
- Helper signatures match the issue body's contract.
-
_validate_read_geo_inforuns first in both helpers. - Partial attrs do not leak on validation failure (tested with both equality and length checks).
- Eager helper covers float + int input dtypes.
-
mask_nodata=Falseopt-out tested. -
mask_sentinel != nodata(GPU MinIsWhite) tested. - Lazy helper omits
nodata_pixels_presentper #2135. - Conflated
dtypesemantics on the lazy helper pinned with an explicit assertion. - Tests pass locally.
- Existing tests still pass.
Ready for merge once CI is green.
Summary
Closes #2177. Wave 1 of #2162. Sibling wave 1 PR #2175 is in progress against disjoint files.
Adds two private helpers in
xrspatial/geotiff/_attrs.pyfor the read-finalization pipeline that is duplicated across the four read backends. No call sites are migrated here; that work belongs to waves 2 and 3 (#2178, #2179, #2180). The helpers are dead code until those PRs land._finalize_eager_read(arr, *, geo_info, nodata, mask_sentinel, mask_nodata, dtype, window, name, ...)validatesgeo_info, populates attrs, masks sentinel pixels, casts dtype, sets the nodata lifecycle attrs (pixels_presentas bool), and returns thexarray.DataArray.mask_sentinelis a separate parameter because the three GPU eager sites derive it three different ways (MinIsWhite inversion, CPU-fallback_mask_nodata, rawnodata)._finalize_lazy_read_attrs(*, geo_info, nodata, mask_nodata, dtype, window, ...)validatesgeo_info, populates attrs, sets nodata attrs withpixels_present=Noneper the documented dask contract from geotiff: split overloaded masked_nodata into separate nodata lifecycle signals #2135. Returns the attrs dict only; the caller assembles the dask graph and builds the DataArray._validate_read_geo_inforuns first in both helpers, so partial attrs cannot leak when validation fails.Test plan
xrspatial/geotiff/tests/test_finalization_helpers_2162.pysynthesisesGeoInfofixtures and exercises the helpers in isolation:nodata,nodata_pixels_present,nodata_dtype_cast,masked_nodata, andgeoref_statusget set correctly on the eager helper for float and int input dtypes.mask_nodata=Falseskips masking but still surfacespixels_present.mask_sentinel != nodata(GPU MinIsWhite inversion case) routes through the masking step.nodata_pixels_present.allow_unparseable_crs=Truebypasses the validator on both helpers.mask_sentinel, lazy helper deliberately does not.21 tests pass locally. Existing
_attrs/nodata_lifecycle/georef_statustests still pass.