Skip to content

Refactor GeoTIFF read finalization and dispatch validation to prevent backend drift #2162

@brendancol

Description

@brendancol

Refactor GeoTIFF Read Finalization and Dispatch Validation

Summary

Refactor the GeoTIFF read stack so shared policy lives in shared helpers instead of being hand-copied across open_geotiff, dask, GPU, and VRT backends. The goal is to prevent backend drift in nodata masking, attrs, dtype casting, georef validation, and public kwarg validation.

Key Changes

  • Add a shared read-finalization helper for eager array backends that handles:
    • read metadata validation
    • coord/dim construction
    • attrs population
    • nodata masking and nodata_pixels_present
    • dtype casting and nodata_dtype_cast
    • final xarray.DataArray construction
  • Add a shared lazy-read attrs helper for dask-style backends that centralizes:
    • _validate_read_geo_info
    • _populate_attrs_from_geo_info
    • _set_nodata_attrs
    • the dask policy that nodata_pixels_present stays unset
  • Add shared public read-argument validation for:
    • overview_level
    • backend-only kwargs such as on_gpu_failure, missing_sources, band_nodata, and max_cloud_bytes
    • file-like source restrictions for GPU/dask
  • Update open_geotiff to become mostly dispatch orchestration:
    • validate args once
    • choose VRT/GPU/dask/eager backend
    • forward normalized kwargs
    • avoid duplicating backend finalization logic
  • Update direct public backends (read_geotiff_dask, read_geotiff_gpu, read_vrt) to call the same relevant validators so direct calls and open_geotiff calls reject the same invalid inputs.

Implementation Notes

  • Keep public APIs unchanged.
  • Prefer adding private helpers in existing internal modules rather than expanding __init__.py.
  • Do not change current documented dask behavior: lazy outputs should not compute nodata_pixels_present.
  • Treat VRT masking drift as a follow-up bug unless it is cheap to fix during extraction; this refactor's core goal is to make that kind of drift harder to reintroduce.
  • Move comments that document policy to the shared helper once; remove duplicated issue-history comments from backend bodies where the helper name/test now carries the intent.

Test Plan

  • Add parity tests proving invalid overview_level values fail through:
    • open_geotiff
    • read_geotiff_dask
    • read_geotiff_gpu
  • Add backend parity tests for nodata lifecycle attrs:
    • eager numpy
    • dask numpy
    • GPU where available or skipped
    • VRT eager and chunked
  • Add dispatch validation tests for ignored-backend kwargs:
    • max_cloud_bytes with dask/GPU/VRT still raises clearly
    • missing_sources on non-VRT still raises
    • band_nodata on non-VRT still raises
    • on_gpu_failure without GPU still raises
  • Run targeted suites:
    • xrspatial/geotiff/tests/test_nodata_lifecycle_attrs_2135.py
    • xrspatial/geotiff/tests/test_nodata_semantics_split_1988.py
    • xrspatial/geotiff/tests/test_georef_status_2136.py
    • xrspatial/geotiff/tests/test_overview_level_type_validation_2074.py
    • backend kwarg/signature parity tests

Assumptions

  • This is a refactor-first issue, not a public API change.
  • The intended behavior is to preserve current passing semantics except where direct backend entry points currently lack wrapper validation.
  • GPU tests may remain optional/skipped when CUDA/CuPy dependencies are unavailable.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions