Refactor GeoTIFF Read Finalization and Dispatch Validation
Summary
Refactor the GeoTIFF read stack so shared policy lives in shared helpers instead of being hand-copied across open_geotiff, dask, GPU, and VRT backends. The goal is to prevent backend drift in nodata masking, attrs, dtype casting, georef validation, and public kwarg validation.
Key Changes
- Add a shared read-finalization helper for eager array backends that handles:
- read metadata validation
- coord/dim construction
- attrs population
- nodata masking and
nodata_pixels_present
- dtype casting and
nodata_dtype_cast
- final
xarray.DataArray construction
- Add a shared lazy-read attrs helper for dask-style backends that centralizes:
_validate_read_geo_info
_populate_attrs_from_geo_info
_set_nodata_attrs
- the dask policy that
nodata_pixels_present stays unset
- Add shared public read-argument validation for:
overview_level
- backend-only kwargs such as
on_gpu_failure, missing_sources, band_nodata, and max_cloud_bytes
- file-like source restrictions for GPU/dask
- Update
open_geotiff to become mostly dispatch orchestration:
- validate args once
- choose VRT/GPU/dask/eager backend
- forward normalized kwargs
- avoid duplicating backend finalization logic
- Update direct public backends (
read_geotiff_dask, read_geotiff_gpu, read_vrt) to call the same relevant validators so direct calls and open_geotiff calls reject the same invalid inputs.
Implementation Notes
- Keep public APIs unchanged.
- Prefer adding private helpers in existing internal modules rather than expanding
__init__.py.
- Do not change current documented dask behavior: lazy outputs should not compute
nodata_pixels_present.
- Treat VRT masking drift as a follow-up bug unless it is cheap to fix during extraction; this refactor's core goal is to make that kind of drift harder to reintroduce.
- Move comments that document policy to the shared helper once; remove duplicated issue-history comments from backend bodies where the helper name/test now carries the intent.
Test Plan
- Add parity tests proving invalid
overview_level values fail through:
open_geotiff
read_geotiff_dask
read_geotiff_gpu
- Add backend parity tests for nodata lifecycle attrs:
- eager numpy
- dask numpy
- GPU where available or skipped
- VRT eager and chunked
- Add dispatch validation tests for ignored-backend kwargs:
max_cloud_bytes with dask/GPU/VRT still raises clearly
missing_sources on non-VRT still raises
band_nodata on non-VRT still raises
on_gpu_failure without GPU still raises
- Run targeted suites:
xrspatial/geotiff/tests/test_nodata_lifecycle_attrs_2135.py
xrspatial/geotiff/tests/test_nodata_semantics_split_1988.py
xrspatial/geotiff/tests/test_georef_status_2136.py
xrspatial/geotiff/tests/test_overview_level_type_validation_2074.py
- backend kwarg/signature parity tests
Assumptions
- This is a refactor-first issue, not a public API change.
- The intended behavior is to preserve current passing semantics except where direct backend entry points currently lack wrapper validation.
- GPU tests may remain optional/skipped when CUDA/CuPy dependencies are unavailable.
Refactor GeoTIFF Read Finalization and Dispatch Validation
Summary
Refactor the GeoTIFF read stack so shared policy lives in shared helpers instead of being hand-copied across
open_geotiff, dask, GPU, and VRT backends. The goal is to prevent backend drift in nodata masking, attrs, dtype casting, georef validation, and public kwarg validation.Key Changes
nodata_pixels_presentnodata_dtype_castxarray.DataArrayconstruction_validate_read_geo_info_populate_attrs_from_geo_info_set_nodata_attrsnodata_pixels_presentstays unsetoverview_levelon_gpu_failure,missing_sources,band_nodata, andmax_cloud_bytesopen_geotiffto become mostly dispatch orchestration:read_geotiff_dask,read_geotiff_gpu,read_vrt) to call the same relevant validators so direct calls andopen_geotiffcalls reject the same invalid inputs.Implementation Notes
__init__.py.nodata_pixels_present.Test Plan
overview_levelvalues fail through:open_geotiffread_geotiff_daskread_geotiff_gpumax_cloud_byteswith dask/GPU/VRT still raises clearlymissing_sourceson non-VRT still raisesband_nodataon non-VRT still raiseson_gpu_failurewithout GPU still raisesxrspatial/geotiff/tests/test_nodata_lifecycle_attrs_2135.pyxrspatial/geotiff/tests/test_nodata_semantics_split_1988.pyxrspatial/geotiff/tests/test_georef_status_2136.pyxrspatial/geotiff/tests/test_overview_level_type_validation_2074.pyAssumptions