geotiff: oracle understands masked-nodata candidates (#1930)#2046
Merged
Conversation
Closes the 5-of-7 xfail shared by every phase-3 backend module. xrspatial reads integer GeoTIFFs whose nodata tag carries an integer sentinel by masking sentinel-equal pixels to NaN and upcasting the array to float (issue #1988). It stamps ``attrs['masked_nodata'] = True`` so a write round-trip can put the tag back. The oracle previously did strict dtype + raw-pixel comparison, so the dtype shift and sentinel-to-NaN rewrite both registered as mismatches. This change adds ``_normalise_for_masked_nodata`` to ``_oracle.py``. When the candidate reports the contract, the rasterio reference is cast to the candidate's float dtype and pixels equal to the sentinel are replaced with NaN; the dtype and pixel assertions then run on directly comparable arrays. Fixtures that do not report the contract pass through unchanged, so the 23+ currently passing fixtures and the citation / JPEG xfails are untouched. Drops the five nodata-masking entries from ``_PARITY_GAPS`` in each of the three on-main backend modules (eager numpy, dask numpy, GPU). The matching ``xfail(strict=True)`` flips to a real failure when the test starts passing, so the oracle change and the cleanup must land together to keep main green. Test coverage in ``golden_corpus/test_oracle.py``: * masked_nodata candidate matches the oracle * masked_nodata flag missing -- strict dtype check still fires * candidate that forgot to mask a sentinel pixel -- pixel check fires * candidate with wrong ``attrs['nodata']`` -- nodata check fires * plain float-NaN fixtures are not affected by the new path * masked_nodata flag plus NaN sentinel passes through cleanly Follow-ups: the same five entries will be removed from the ``_PARITY_GAPS`` tables in the still-open phase-3 PRs #2040 (dask+GPU), #2041 (HTTP), and #2042 (VRT) via separate commits on each branch.
Address PR #2046 review: * Reject fractional sentinels via ``float(nd_float).is_integer()``. Without the guard, a sentinel like ``3.5`` would cast to ``3`` and mask every 3-valued pixel. * Reject sentinels outside the source integer dtype's range via ``info.min <= int(nd_float) <= info.max``. Without the guard, ``np.uint16(-1.0)`` wraps to ``65535`` and would mask the dtype-max pixel. Both guards mirror the upstream xrspatial reader's checks in ``xrspatial/geotiff/__init__.py``, so the oracle's interpretation of the masked-nodata contract is now strictly tighter than what the reader can emit. Added two tests: * ``test_masked_nodata_fractional_sentinel_does_not_mask`` -- builds a fractional-nodata fixture, confirms the oracle stays on the raw-pixel path (dtype mismatch fires, not pixel mismatch). * ``test_masked_nodata_out_of_range_sentinel_does_not_mask`` -- rasterio refuses to write an out-of-range nodata at the writer level, so this calls ``_normalise_for_masked_nodata`` directly with synthesised inputs and confirms the inputs pass through unchanged.
brendancol
added a commit
that referenced
this pull request
May 18, 2026
Follow-up to #2046, which extended the oracle to handle ``attrs['masked_nodata']=True`` candidates. The five ``nodata_int_sentinel_uint16`` / ``stripped_*_uint16`` / ``tiled_*_uint16`` fixtures now pass the oracle directly, so keep ``xfail(strict=True)`` against them would flip to unexpected-pass and break the module. This branch is also rebased onto current main so the new oracle behaviour is what runs.
brendancol
added a commit
that referenced
this pull request
May 18, 2026
* geotiff: golden corpus phase 3 PR 4, dask+GPU backend (#1930) Mirrors the eager / dask / GPU parity layers but reads each fixture through ``open_geotiff(str(path), gpu=True, chunks=32, on_gpu_failure='strict')``, returning a dask-of-cupy DataArray. The oracle pulls pixels via ``.compute()`` then ``.get()`` so the comparison machinery is unchanged. Skips cleanly when no CUDA device is reachable; strict on-gpu-failure keeps a silent CPU fallback from masking dask+GPU coverage. 23 fixtures pass, 7 xfailed (shared codec/attrs gaps), 2 skipped. ``_DASK_GPU_SKIPS`` is reserved for combo-specific gaps and is empty in the first pass. A ``test_dask_gpu_candidate_is_chunked_and_on_device`` belt-and-braces check catches both failure modes (``chunks=`` dropped or ``gpu=True`` silently fallen back) by computing one chunk and asserting it is a ``cupy.ndarray``. * geotiff: strengthen dask+GPU chunked sanity check (#1930) Mirror the phase 3 PR 2 fix: assert the chunk grid is at least 2x2 along the spatial axes after picking a fixture whose pixel extent is at least ``2 * CHUNK_SIZE``. Catches the failure mode where ``chunks=`` is accepted but stitched into a single chunk that covers the whole file (windowing logic never runs). * geotiff: drop nodata-masking gap from dask+GPU module (#1930) Follow-up to #2046, which extended the oracle to handle ``attrs['masked_nodata']=True`` candidates. The five ``nodata_int_sentinel_uint16`` / ``stripped_*_uint16`` / ``tiled_*_uint16`` fixtures now pass the oracle directly, so keep ``xfail(strict=True)`` against them would flip to unexpected-pass and break the module. This branch is also rebased onto current main so the new oracle behaviour is what runs.
3 tasks
brendancol
added a commit
that referenced
this pull request
May 18, 2026
Closes the citation-only CRS parity gap that every phase-3 backend module flagged. The crs_citation_only fixture in the golden corpus carries a user-defined geographic CRS (no EPSG, no WKT in the citation) and the reader was decoding it only into the deprecated geog_citation / datum_code / angular_units attrs. attrs['crs_wkt'] stayed None, so the oracle's _candidate_crs returned None on the candidate side and compare_to_oracle failed. This change adds _synthesize_user_defined_wkt to _geotags.py. When the file declares a user-defined geographic CRS and exposes the ellipsoid (semi_major plus semi_minor or inv_flattening) and the angular-units GeoKeys, the helper feeds those parameters to pyproj.CRS.from_dict and stamps the resulting WKT on attrs['crs_wkt']. The synthesis fails closed (returns None) when pyproj is missing, when there is no semi_major, or when the ellipsoid shape is ambiguous, so the existing deprecated-attrs path stays in place. Projected user-defined CRSes (ModelTypeProjected with ProjectedCSType == 32767) are not yet reconstructible from GeoKeys alone -- they need the GeogPrime / Projection parameters that the corpus does not exercise -- so the helper returns None for that case. Drops the crs_citation_only entry from _PARITY_GAPS in each of the four phase-3 backend modules (eager numpy, dask numpy, GPU, dask+GPU). xfail(strict=True) flips to a real failure when the test starts passing, so the fix and the entry cleanups have to land together. Each module's top-of-file docstring moves the citation entry from "Real parity gaps" to "Resolved gaps", mirroring the masked-nodata fix from PR #2046. Adds unit tests for _synthesize_user_defined_wkt in test_user_defined_crs_wkt_1632.py: the sphere case (the citation fixture shape), the oblate ellipsoid case, the projected-CRS skip path, and the missing-ellipsoid skip path. Adds end-to-end fixture tests in test_oracle.py for attrs['crs_wkt'] stamping, PROJ-dict equality with the rasterio reference, and compare_to_oracle round-trip. After the fix the four phase-3 backend modules report 115 passed, 8 skipped, 4 xfailed (the remaining JPEG axis-order gap). The full golden_corpus / corpus suite reports 253 passed, 8 skipped, 4 xfailed with no XPASS.
brendancol
added a commit
that referenced
this pull request
May 18, 2026
…2054) * geotiff: synthesize WKT for citation-only user-defined CRSes (#1930) Closes the citation-only CRS parity gap that every phase-3 backend module flagged. The crs_citation_only fixture in the golden corpus carries a user-defined geographic CRS (no EPSG, no WKT in the citation) and the reader was decoding it only into the deprecated geog_citation / datum_code / angular_units attrs. attrs['crs_wkt'] stayed None, so the oracle's _candidate_crs returned None on the candidate side and compare_to_oracle failed. This change adds _synthesize_user_defined_wkt to _geotags.py. When the file declares a user-defined geographic CRS and exposes the ellipsoid (semi_major plus semi_minor or inv_flattening) and the angular-units GeoKeys, the helper feeds those parameters to pyproj.CRS.from_dict and stamps the resulting WKT on attrs['crs_wkt']. The synthesis fails closed (returns None) when pyproj is missing, when there is no semi_major, or when the ellipsoid shape is ambiguous, so the existing deprecated-attrs path stays in place. Projected user-defined CRSes (ModelTypeProjected with ProjectedCSType == 32767) are not yet reconstructible from GeoKeys alone -- they need the GeogPrime / Projection parameters that the corpus does not exercise -- so the helper returns None for that case. Drops the crs_citation_only entry from _PARITY_GAPS in each of the four phase-3 backend modules (eager numpy, dask numpy, GPU, dask+GPU). xfail(strict=True) flips to a real failure when the test starts passing, so the fix and the entry cleanups have to land together. Each module's top-of-file docstring moves the citation entry from "Real parity gaps" to "Resolved gaps", mirroring the masked-nodata fix from PR #2046. Adds unit tests for _synthesize_user_defined_wkt in test_user_defined_crs_wkt_1632.py: the sphere case (the citation fixture shape), the oblate ellipsoid case, the projected-CRS skip path, and the missing-ellipsoid skip path. Adds end-to-end fixture tests in test_oracle.py for attrs['crs_wkt'] stamping, PROJ-dict equality with the rasterio reference, and compare_to_oracle round-trip. After the fix the four phase-3 backend modules report 115 passed, 8 skipped, 4 xfailed (the remaining JPEG axis-order gap). The full golden_corpus / corpus suite reports 253 passed, 8 skipped, 4 xfailed with no XPASS. * Address review nits: named constant, drop unused params (#1930) * Adds ``GEOKEY_GEOG_SEMI_MINOR_AXIS = 2058`` next to the other GeoKey ID constants and uses it in ``extract_geo_info`` instead of the raw int literal. Matches the existing convention for the semi-major-axis and inv-flattening keys. * Drops ``angular_units_code`` and ``geog_citation`` from ``_synthesize_user_defined_wkt``. The function never read them. PROJ's ``longlat`` always emits degrees, and the corpus has no radian-unit user-defined fixture; the docstring now calls that limit out so a future radian fixture forces a deliberate signature change rather than silently degrading the units. * Expands the docstring's "not handled" section to call out ``MODEL_TYPE_GEOCENTRIC`` (3) and unknown / zero model_type. Adds a ``test_synthesize_user_defined_wkt_geocentric_returns_none`` pin so a future change that promotes geocentric to a real proj_dict has to touch the test deliberately. Full corpus suite still reports 287 passed, 8 skipped, 4 xfailed (JPEG axis-order gap), 0 XPASS.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Closes 5 of the 7
xfail(strict=True)cells shared by every phase-3 backend module of the golden corpus plan in #1930.xrspatial reads integer GeoTIFFs whose nodata tag carries an integer sentinel by masking sentinel-equal pixels to NaN and upcasting the array to float (see #1988). It stamps
attrs['masked_nodata'] = Trueso a write round-trip can put the tag back. The oracle previously did strict dtype + raw-pixel comparison, so the dtype shift and the sentinel-to-NaN rewrite both registered as mismatches.This PR adds
_normalise_for_masked_nodatato_oracle.py. When the candidate reports the masked-nodata contract, the rasterio reference is cast to the candidate's float dtype and any pixel equal to the integer sentinel is replaced with NaN. The dtype and pixel assertions then run on directly comparable arrays. Fixtures that do not report the contract pass through unchanged, so the 23 currently passing fixtures per backend and the remaining citation / JPEG xfails are untouched.The five nodata entries are dropped from the
_PARITY_GAPStables in the three on-main backend modules (eager numpy, dask numpy, GPU).xfail(strict=True)flips to a hard failure when a test starts passing, so the oracle change and the per-module cleanup must land together.Result: 212 corpus tests pass, 6 xfailed (down from 21), 6 skipped.
What this PR does not do
The three other still-open phase-3 PRs each carry the same five entries in their own
_PARITY_GAPS:After this PR merges, a one-line follow-up commit per branch removes the entries. Those follow-ups are not part of this rockout.
Test plan
pytest xrspatial/geotiff/tests/golden_corpus/test_oracle.py: 30 passed (24 existing + 6 new for the masked-nodata path)pytest xrspatial/geotiff/tests/test_golden_corpus_*_1930.py xrspatial/geotiff/tests/golden_corpus/: 212 passed, 6 xfailed, 6 skippedRefs #1930. Builds on #1988 (split nodata contract, already merged).