Skip to content

Commit 339df70

Browse files
authored
geotiff: synthesize WKT for citation-only user-defined CRSes (#1930) (#2054)
* geotiff: synthesize WKT for citation-only user-defined CRSes (#1930) Closes the citation-only CRS parity gap that every phase-3 backend module flagged. The crs_citation_only fixture in the golden corpus carries a user-defined geographic CRS (no EPSG, no WKT in the citation) and the reader was decoding it only into the deprecated geog_citation / datum_code / angular_units attrs. attrs['crs_wkt'] stayed None, so the oracle's _candidate_crs returned None on the candidate side and compare_to_oracle failed. This change adds _synthesize_user_defined_wkt to _geotags.py. When the file declares a user-defined geographic CRS and exposes the ellipsoid (semi_major plus semi_minor or inv_flattening) and the angular-units GeoKeys, the helper feeds those parameters to pyproj.CRS.from_dict and stamps the resulting WKT on attrs['crs_wkt']. The synthesis fails closed (returns None) when pyproj is missing, when there is no semi_major, or when the ellipsoid shape is ambiguous, so the existing deprecated-attrs path stays in place. Projected user-defined CRSes (ModelTypeProjected with ProjectedCSType == 32767) are not yet reconstructible from GeoKeys alone -- they need the GeogPrime / Projection parameters that the corpus does not exercise -- so the helper returns None for that case. Drops the crs_citation_only entry from _PARITY_GAPS in each of the four phase-3 backend modules (eager numpy, dask numpy, GPU, dask+GPU). xfail(strict=True) flips to a real failure when the test starts passing, so the fix and the entry cleanups have to land together. Each module's top-of-file docstring moves the citation entry from "Real parity gaps" to "Resolved gaps", mirroring the masked-nodata fix from PR #2046. Adds unit tests for _synthesize_user_defined_wkt in test_user_defined_crs_wkt_1632.py: the sphere case (the citation fixture shape), the oblate ellipsoid case, the projected-CRS skip path, and the missing-ellipsoid skip path. Adds end-to-end fixture tests in test_oracle.py for attrs['crs_wkt'] stamping, PROJ-dict equality with the rasterio reference, and compare_to_oracle round-trip. After the fix the four phase-3 backend modules report 115 passed, 8 skipped, 4 xfailed (the remaining JPEG axis-order gap). The full golden_corpus / corpus suite reports 253 passed, 8 skipped, 4 xfailed with no XPASS. * Address review nits: named constant, drop unused params (#1930) * Adds ``GEOKEY_GEOG_SEMI_MINOR_AXIS = 2058`` next to the other GeoKey ID constants and uses it in ``extract_geo_info`` instead of the raw int literal. Matches the existing convention for the semi-major-axis and inv-flattening keys. * Drops ``angular_units_code`` and ``geog_citation`` from ``_synthesize_user_defined_wkt``. The function never read them. PROJ's ``longlat`` always emits degrees, and the corpus has no radian-unit user-defined fixture; the docstring now calls that limit out so a future radian fixture forces a deliberate signature change rather than silently degrading the units. * Expands the docstring's "not handled" section to call out ``MODEL_TYPE_GEOCENTRIC`` (3) and unknown / zero model_type. Adds a ``test_synthesize_user_defined_wkt_geocentric_returns_none`` pin so a future change that promotes geocentric to a real proj_dict has to touch the test deliberately. Full corpus suite still reports 287 passed, 8 skipped, 4 xfailed (JPEG axis-order gap), 0 XPASS.
1 parent e1bcc3d commit 339df70

7 files changed

Lines changed: 358 additions & 51 deletions

xrspatial/geotiff/_geotags.py

Lines changed: 110 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -67,6 +67,7 @@
6767
GEOKEY_GEOG_LINEAR_UNITS = 2052
6868
GEOKEY_GEOG_ANGULAR_UNITS = 2054
6969
GEOKEY_GEOG_SEMI_MAJOR_AXIS = 2057
70+
GEOKEY_GEOG_SEMI_MINOR_AXIS = 2058
7071
GEOKEY_GEOG_INV_FLATTENING = 2059
7172
GEOKEY_PROJECTED_CS_TYPE = 3072
7273
GEOKEY_PROJ_CITATION = 3073
@@ -308,6 +309,92 @@ def _looks_like_wkt(text) -> bool:
308309
return any(head.startswith(p) for p in _WKT_PREFIXES)
309310

310311

312+
def _synthesize_user_defined_wkt(
313+
model_type: int,
314+
semi_major: float | None,
315+
semi_minor: float | None,
316+
inv_flattening: float | None,
317+
) -> str | None:
318+
"""Build a canonical WKT for a user-defined geographic CRS.
319+
320+
Triggered from :func:`extract_geo_info` when the file declares a
321+
user-defined geographic CRS (``GeographicTypeGeoKey == 32767`` and
322+
no projected EPSG) and the citation does not already carry a WKT
323+
string. The classic example is a GeoTIFF whose CRS lives only in
324+
``GeogCitationGeoKey`` as a free-form name, with the ellipsoid
325+
exposed via separate GeoKeys (``GeogSemiMajorAxisGeoKey``,
326+
``GeogSemiMinorAxisGeoKey``, ``GeogInvFlatteningGeoKey``).
327+
Rasterio reads such files and reports a GEOGCS CRS; pre-#1930
328+
xrspatial dropped the projection silently because nothing stamped
329+
a canonical ``attrs['crs_wkt']``. See issue #1930 (Phase 3
330+
``crs_citation_only`` parity gap).
331+
332+
Returns a WKT string when pyproj is installed and the GeoKeys
333+
expose enough of the ellipsoid to feed ``pyproj.CRS.from_dict``,
334+
otherwise ``None``. The synthesized CRS is name-stripped (PROJ
335+
drops the GEOGCS name on ``to_dict()`` anyway), so callers that
336+
need the user-supplied name should still consult
337+
``attrs['geog_citation']`` -- this helper exists to close the
338+
canonical-CRS parity gap, not to round-trip the citation field.
339+
340+
Angular units are not threaded through. PROJ's ``longlat`` always
341+
emits degrees, and the corpus has no radian-unit user-defined
342+
fixture; if one shows up, this helper needs an angular_units_code
343+
parameter and the proj_dict needs a ``to_meter`` / units shim.
344+
345+
Non-geographic model types fall through to ``None``:
346+
347+
* ``MODEL_TYPE_PROJECTED`` (1) user-defined CRSes (``ProjectedCSType
348+
== 32767``) need the GeogPrime / Projection parameters that this
349+
helper does not yet read.
350+
* ``MODEL_TYPE_GEOCENTRIC`` (3) and unknown / zero model types are
351+
not exercised by the corpus and would need their own per-type
352+
proj_dict construction.
353+
354+
In all three cases the caller falls through to the existing
355+
deprecated-attrs path.
356+
"""
357+
if model_type != MODEL_TYPE_GEOGRAPHIC:
358+
return None
359+
if semi_major is None:
360+
return None
361+
# No ellipsoid: a degenerate file. Refuse to fabricate one rather
362+
# than emit a misleading WGS84 fallback.
363+
if semi_minor is None and inv_flattening is None:
364+
return None
365+
try:
366+
from pyproj import CRS
367+
except ImportError:
368+
return None
369+
370+
proj_dict: dict[str, object] = {'proj': 'longlat', 'no_defs': True}
371+
proj_dict['a'] = float(semi_major)
372+
if inv_flattening is not None and float(inv_flattening) != 0.0:
373+
proj_dict['rf'] = float(inv_flattening)
374+
elif semi_minor is not None:
375+
proj_dict['b'] = float(semi_minor)
376+
else:
377+
# inv_flattening == 0 is the GeoTIFF convention for a sphere
378+
# (b == a). Encode it as such for PROJ.
379+
proj_dict['b'] = float(semi_major)
380+
381+
try:
382+
crs = CRS.from_dict(proj_dict)
383+
return crs.to_wkt()
384+
except Exception as exc:
385+
import warnings
386+
from . import _geotiff_strict_mode, GeoTIFFFallbackWarning
387+
if _geotiff_strict_mode():
388+
raise
389+
warnings.warn(
390+
f"_synthesize_user_defined_wkt failed "
391+
f"({type(exc).__name__}: {exc}); returning None.",
392+
GeoTIFFFallbackWarning,
393+
stacklevel=2,
394+
)
395+
return None
396+
397+
311398
def _parse_geokeys(ifd: IFD, data: bytes | memoryview,
312399
byte_order: str) -> dict[int, int | float | str]:
313400
"""Parse the GeoKeyDirectory and resolve values from param tags.
@@ -609,6 +696,12 @@ def extract_geo_info(ifd: IFD, data: bytes | memoryview,
609696
semi_major = geokeys.get(GEOKEY_GEOG_SEMI_MAJOR_AXIS)
610697
if not isinstance(semi_major, (int, float)):
611698
semi_major = None
699+
# GeogSemiMinorAxisGeoKey. Kept local for WKT synthesis below; not
700+
# surfaced as a separate attr because the canonical ellipsoid lives
701+
# in attrs['crs_wkt'].
702+
semi_minor = geokeys.get(GEOKEY_GEOG_SEMI_MINOR_AXIS)
703+
if not isinstance(semi_minor, (int, float)):
704+
semi_minor = None
612705
inv_flat = geokeys.get(GEOKEY_GEOG_INV_FLATTENING)
613706
if not isinstance(inv_flat, (int, float)):
614707
inv_flat = None
@@ -728,6 +821,23 @@ def extract_geo_info(ifd: IFD, data: bytes | memoryview,
728821
# to_geotiff only consults attrs['crs'] / attrs['crs_wkt']. See
729822
# issue #1632.
730823
crs_wkt = crs_name
824+
else:
825+
# Citation-only user-defined geographic CRS: no EPSG, no WKT
826+
# in the citation, but enough ellipsoid / units GeoKeys to
827+
# synthesize a canonical WKT (issue #1930 Phase 3). Without
828+
# this branch ``attrs['crs_wkt']`` stays None and the
829+
# golden-corpus oracle reports a CRS parity gap.
830+
synth = _synthesize_user_defined_wkt(
831+
model_type=(
832+
int(model_type)
833+
if isinstance(model_type, (int, float)) else 0
834+
),
835+
semi_major=semi_major,
836+
semi_minor=semi_minor,
837+
inv_flattening=inv_flat,
838+
)
839+
if synth is not None:
840+
crs_wkt = synth
731841

732842
return GeoInfo(
733843
transform=transform,

xrspatial/geotiff/tests/golden_corpus/test_oracle.py

Lines changed: 58 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -517,6 +517,64 @@ def test_crs_citation_only_fixture_rejects_unrelated_crs() -> None:
517517
compare_to_oracle(path, cand)
518518

519519

520+
def test_crs_citation_only_open_geotiff_stamps_canonical_wkt() -> None:
521+
"""``open_geotiff`` on the citation fixture stamps ``attrs['crs_wkt']``.
522+
523+
The user-defined geographic CRS in ``crs_citation_only.tif`` has no
524+
EPSG code and no WKT in the citation; only the ellipsoid radius,
525+
inverse flattening, and angular-units GeoKeys are populated. The
526+
reader synthesizes a canonical WKT from those parameters via
527+
:func:`xrspatial.geotiff._geotags._synthesize_user_defined_wkt`,
528+
which closes the Phase 3 ``crs_citation_only`` parity gap.
529+
530+
Pinned here so any future refactor that drops the synthesis branch
531+
re-opens the gap visibly. The companion oracle test
532+
(``test_crs_citation_only_xrspatial_round_trips_through_oracle``)
533+
drives the same fixture through ``compare_to_oracle`` to confirm the
534+
stamp lands somewhere the parity check accepts.
535+
"""
536+
from xrspatial.geotiff import open_geotiff
537+
538+
path = _CRS_FIXTURE_DIR / 'crs_citation_only.tif'
539+
da = open_geotiff(str(path))
540+
541+
# The fixture has no EPSG, so the canonical EPSG attr stays absent.
542+
assert da.attrs.get('crs') is None
543+
wkt = da.attrs.get('crs_wkt')
544+
assert isinstance(wkt, str) and wkt, (
545+
f"open_geotiff must stamp a non-empty crs_wkt on the "
546+
f"citation-only fixture; got {wkt!r}"
547+
)
548+
# PROJ structural sanity: the synthesized WKT must parse back
549+
# through rasterio's CRS.from_wkt (which delegates to PROJ).
550+
parsed = rasterio.crs.CRS.from_wkt(wkt)
551+
assert parsed is not None
552+
# And it must share PROJ-dict shape with the rasterio reference.
553+
with rasterio.open(path) as src:
554+
ref = src.crs
555+
assert parsed.to_dict() == ref.to_dict(), (
556+
f"synthesized WKT must round-trip to the same PROJ dict as "
557+
f"the rasterio reference; got {parsed.to_dict()} vs "
558+
f"{ref.to_dict()}"
559+
)
560+
561+
562+
def test_crs_citation_only_xrspatial_round_trips_through_oracle() -> None:
563+
"""``compare_to_oracle`` accepts the xrspatial-stamped citation CRS.
564+
565+
Drives the citation fixture through ``open_geotiff`` (the exact
566+
code path the Phase 3 backend parametrizations use) and runs the
567+
result through ``compare_to_oracle``. This is the end-to-end
568+
parity check that flips the corpus from xfail to pass once
569+
``_synthesize_user_defined_wkt`` is wired in.
570+
"""
571+
from xrspatial.geotiff import open_geotiff
572+
573+
path = _CRS_FIXTURE_DIR / 'crs_citation_only.tif'
574+
da = open_geotiff(str(path))
575+
compare_to_oracle(path, da)
576+
577+
520578
def test_crs_wkt_utm10n_fixture_accepts_wkt_attr() -> None:
521579
"""``crs_wkt_utm10n`` also accepts a candidate that carries crs_wkt.
522580

xrspatial/geotiff/tests/test_golden_corpus_dask_gpu_1930.py

Lines changed: 15 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -13,6 +13,14 @@
1313
the eager / dask / GPU modules. ``_DASK_GPU_SKIPS`` is reserved for
1414
gaps that surface only when chunked GPU reads stitch through nvCOMP
1515
plus dask; it starts empty.
16+
17+
Resolved gaps (no longer xfail):
18+
19+
* Integer nodata masking -- closed by the oracle's
20+
``_normalise_for_masked_nodata`` helper.
21+
* ``crs_citation_only`` -- closed by ``_synthesize_user_defined_wkt``
22+
in ``_geotags.py``, which stamps a canonical
23+
``attrs['crs_wkt']`` on user-defined geographic CRSes.
1624
"""
1725
from __future__ import annotations
1826

@@ -53,15 +61,13 @@
5361
# Integer-nodata masking used to live here too; the oracle's
5462
# _normalise_for_masked_nodata helper (#2046) closes that gap so it is
5563
# no longer xfailed on any backend. The multi-band axis-order gap for
56-
# the JPEG-YCbCr fixture is also closed (see ``_normalise_axis_order``
57-
# in ``_oracle.py``).
58-
_PARITY_GAPS: dict[str, str] = {
59-
"crs_citation_only": (
60-
"citation-only CRS: xrspatial decodes the citation into deprecated "
61-
"attrs['geog_citation'] but does not emit a canonical attrs['crs'] "
62-
"or attrs['crs_wkt']. Real parity gap; needs a fix in _crs.py."
63-
),
64-
}
64+
# the JPEG-YCbCr fixture is also closed on the eager / dask paths
65+
# (see ``_normalise_axis_order`` in ``_oracle.py``); on the dask+GPU
66+
# pipeline the decode error wins first so the JPEG fixture lives in
67+
# ``_INTENTIONAL_SKIPS`` below. crs_citation_only moved off this list
68+
# once ``_synthesize_user_defined_wkt`` started stamping a canonical
69+
# WKT for user-defined geographic CRSes.
70+
_PARITY_GAPS: dict[str, str] = {}
6571

6672
# Empty: failures unique to the dask+GPU combo (eager / dask / pure-
6773
# GPU all pass) would land here. Kept around as the documented home

xrspatial/geotiff/tests/test_golden_corpus_dask_numpy_1930.py

Lines changed: 19 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -10,12 +10,20 @@
1010
shows up here.
1111
1212
The skip / xfail taxonomy is intentionally identical to the eager
13-
module's. The parity gaps that motivated them (integer-nodata masking,
14-
citation CRS encoding, RGB band axis order, MinIsWhite inversion) live
15-
in the codec / attrs layer, which both backends share. If a future
16-
parity gap turns out to be dask-specific, add it to ``_DASK_SKIPS``
17-
rather than ``_PARITY_GAPS`` so the difference between "shared
18-
codec/attrs gap" and "dask plumbing gap" stays legible.
13+
module's. The parity gaps that motivated them (RGB band axis order,
14+
MinIsWhite inversion) live in the codec / attrs layer, which both
15+
backends share. If a future parity gap turns out to be dask-specific,
16+
add it to ``_DASK_SKIPS`` rather than ``_PARITY_GAPS`` so the
17+
difference between "shared codec/attrs gap" and "dask plumbing gap"
18+
stays legible.
19+
20+
Resolved gaps (no longer xfail):
21+
22+
* Integer nodata masking -- closed by the oracle's
23+
``_normalise_for_masked_nodata`` helper.
24+
* ``crs_citation_only`` -- closed by ``_synthesize_user_defined_wkt``
25+
in ``_geotags.py``, which stamps a canonical
26+
``attrs['crs_wkt']`` on user-defined geographic CRSes.
1927
2028
The chunk size is fixed at 32. Most corpus fixtures are 64x64 or
2129
smaller, so 32 produces either a 2x2 chunk grid or a single chunk;
@@ -52,14 +60,11 @@
5260
# oracle's _normalise_for_masked_nodata helper closes that gap so it
5361
# is no longer xfailed on any backend. The multi-band axis-order gap
5462
# for the JPEG-YCbCr fixture is also closed (see
55-
# ``_normalise_axis_order`` in ``_oracle.py``).
56-
_PARITY_GAPS: dict[str, str] = {
57-
"crs_citation_only": (
58-
"citation-only CRS: xrspatial decodes the citation into deprecated "
59-
"attrs['geog_citation'] but does not emit a canonical attrs['crs'] "
60-
"or attrs['crs_wkt']. Real parity gap; needs a fix in _crs.py."
61-
),
62-
}
63+
# ``_normalise_axis_order`` in ``_oracle.py``). The citation-only CRS
64+
# gap is closed by ``_synthesize_user_defined_wkt`` in ``_geotags.py``,
65+
# which stamps a canonical WKT for user-defined geographic CRSes onto
66+
# ``attrs['crs_wkt']``.
67+
_PARITY_GAPS: dict[str, str] = {}
6368

6469
# Dask-only gaps go here. Empty in the first pass; add an entry only when
6570
# a fixture is dask-specific (i.e. eager passes, dask does not).

xrspatial/geotiff/tests/test_golden_corpus_eager_numpy_1930.py

Lines changed: 10 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -24,10 +24,9 @@
2424
2525
Real parity gaps (``xfail``):
2626
27-
* ``crs_citation_only`` -- xrspatial decodes the citation into the
28-
deprecated ``attrs['geog_citation']`` but does not emit a canonical
29-
``attrs['crs']`` or ``attrs['crs_wkt']``. Real parity gap; needs a fix
30-
in ``_crs.py`` to round-trip citation WKT.
27+
None today. Every parity gap the corpus has surfaced so far has either
28+
been fixed (see Resolved gaps below) or moved to a per-backend skip
29+
table because it is not a shared decode/attrs gap.
3130
3231
Resolved gaps (no longer xfail):
3332
@@ -43,6 +42,12 @@
4342
The oracle's ``_normalise_axis_order`` helper now transposes the
4443
trailing-band candidate to leading-band before the shape and pixel
4544
checks run, so this fixture passes directly.
45+
* ``crs_citation_only`` -- the GeoTIFF reader now synthesizes a
46+
canonical WKT for user-defined geographic CRSes from the ellipsoid
47+
/ units GeoKeys and stamps it on ``attrs['crs_wkt']`` (see
48+
``_synthesize_user_defined_wkt`` in ``_geotags.py``). The oracle's
49+
``_candidate_crs`` picks it up via the WKT branch, and ``_crs_equal``
50+
compares the PROJ dicts because neither side has an EPSG code.
4651
4752
Intentional skip (``skip``):
4853
@@ -78,13 +83,7 @@
7883
)
7984

8085

81-
_PARITY_GAPS: dict[str, str] = {
82-
"crs_citation_only": (
83-
"citation-only CRS: xrspatial decodes the citation into deprecated "
84-
"attrs['geog_citation'] but does not emit a canonical attrs['crs'] "
85-
"or attrs['crs_wkt']. Real parity gap; needs a fix in _crs.py."
86-
),
87-
}
86+
_PARITY_GAPS: dict[str, str] = {}
8887

8988
_INTENTIONAL_SKIPS: dict[str, str] = {
9089
"nodata_miniswhite_uint8": (

xrspatial/geotiff/tests/test_golden_corpus_gpu_1930.py

Lines changed: 20 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -11,13 +11,22 @@
1111
this module; runs with a GPU exercise every fixture the eager and dask
1212
backends already do.
1313
14-
``_PARITY_GAPS`` carries over the codec / attrs gaps that all three
15-
backends share (citation CRS, integer nodata masking). ``_GPU_SKIPS``
16-
holds GPU-only failures, currently the JPEG-YCbCr fixture: the GPU
17-
decoder does not handle it and ``on_gpu_failure='strict'`` raises
18-
rather than falling back, so the read fails before the oracle can
19-
compare. On eager / dask the same fixture exposes the RGB axis-order
20-
divergence; on GPU strict mode it never gets that far.
14+
``_PARITY_GAPS`` is empty for the GPU backend now that integer nodata
15+
masking and the citation-only CRS gap are closed at the codec / attrs
16+
layer. ``_GPU_SKIPS`` holds GPU-only failures, currently the
17+
JPEG-YCbCr fixture: the GPU decoder does not handle it and
18+
``on_gpu_failure='strict'`` raises rather than falling back, so the
19+
read fails before the oracle can compare. On eager / dask the same
20+
fixture exposes the RGB axis-order divergence; on GPU strict mode it
21+
never gets that far.
22+
23+
Resolved gaps (no longer xfail):
24+
25+
* Integer nodata masking -- closed by the oracle's
26+
``_normalise_for_masked_nodata`` helper.
27+
* ``crs_citation_only`` -- closed by ``_synthesize_user_defined_wkt``
28+
in ``_geotags.py``, which stamps a canonical
29+
``attrs['crs_wkt']`` on user-defined geographic CRSes.
2130
2231
The GPU read is configured with ``on_gpu_failure='strict'`` so a codec
2332
that would silently CPU-fall-back instead surfaces as an xfail / fail
@@ -56,16 +65,10 @@
5665
)
5766

5867

59-
# Integer-nodata masking used to live here too; the oracle's
60-
# _normalise_for_masked_nodata helper now closes that gap so it is no
61-
# longer xfailed on any backend.
62-
_PARITY_GAPS: dict[str, str] = {
63-
"crs_citation_only": (
64-
"citation-only CRS: xrspatial decodes the citation into deprecated "
65-
"attrs['geog_citation'] but does not emit a canonical attrs['crs'] "
66-
"or attrs['crs_wkt']. Real parity gap; needs a fix in _crs.py."
67-
),
68-
}
68+
# Integer-nodata masking used to live here (closed by the oracle's
69+
# _normalise_for_masked_nodata helper), and crs_citation_only used to
70+
# live here too (closed by _synthesize_user_defined_wkt in _geotags.py).
71+
_PARITY_GAPS: dict[str, str] = {}
6972

7073
# GPU-only gaps. Empty in the current corpus: failures here would be
7174
# fixtures the eager and dask backends decode cleanly but the GPU

0 commit comments

Comments
 (0)