Skip to content

Improve robustness in match_candidate_items_to_windows when re-projection is needed#541

Open
favyen2 wants to merge 5 commits intomasterfrom
favyen/20260302-reprojection-robustness
Open

Improve robustness in match_candidate_items_to_windows when re-projection is needed#541
favyen2 wants to merge 5 commits intomasterfrom
favyen/20260302-reprojection-robustness

Conversation

@favyen2
Copy link
Collaborator

@favyen2 favyen2 commented Mar 3, 2026

  • Refactor safely_reproject_within_valid_area and use it when matching items to windows.
  • Call shapely.make_valid in STGeometry.to_projection to handle edge cases.

…tion is needed.

- Refactor safely_reproject_within_valid_area and use it when matching items to windows.
- Call shapely.make_valid in STGeometry.to_projection to handle edge cases.
@favyen2
Copy link
Collaborator Author

favyen2 commented Mar 3, 2026

@claude can you review this PR?

@claude
Copy link

claude bot commented Mar 3, 2026

Claude finished @favyen2's task in 2m 30s —— View job


PR Review

  • Read CLAUDE.md and gather context
  • Review changed files
  • Provide comprehensive review feedback

Overall this is a well-motivated PR. The core idea — clip in WGS84 first, then reproject — is more robust than the old approach of either naively reprojecting large geometries or using the is_too_large() heuristic. The make_valid fix in to_projection is a clean defensive measure. Below are my findings.


Correctness

1. Behavioral change in vector_format.py: geometries no longer clipped to requested bounds

The old safely_reproject_and_clip returned geometries clipped to the exact destination bounds. The new function clips to a buffered WGS84 region (1 degree padding), so returned Feature geometries can extend outside the queried bounds.

In vector_format.py:358-369, only an intersects check is done — the returned Feature still has the full clipped-but-not-trimmed geometry:

# reproj may extend ~1 degree outside bounds
reprojected_features.append(Feature(reproj, feat.properties))

If downstream callers expect feature geometries to lie within the queried bounds (e.g., for rendering or spatial joins), this could silently produce wrong results. The removed test_clipping test was the only test verifying this contract. Consider either:

  • Clipping reproj.shp to dst_geom.shp before creating the Feature, or
  • Documenting that the returned geometries may extend beyond bounds Fix this →

2. safely_reproject_within_valid_area same-projection short-circuit doesn't check intersection

geometry.py:629-631:

if src_geom.projection == valid_geom.projection:
    results.append(src_geom)   # returned even if it doesn't intersect valid_geom
    continue

The docstring says "Returns None for source geometries that don't intersect valid_geom in WGS84" — but the same-projection case skips this entirely and always returns src_geom. This is consistent with how the callers handle it (they do their own intersection check afterwards), but the docstring is misleading. Fix this →


Minor Issues

3. intersects() calls CRS.from_epsg(WGS84_EPSG) instead of reusing WGS84_PROJECTION.crs

geometry.py:316,319:

if self.projection.crs != CRS.from_epsg(WGS84_EPSG):   # allocates a new CRS object
    ...
if other.projection.crs != CRS.from_epsg(WGS84_EPSG):  # same

_to_wgs84 in the same file already uses WGS84_PROJECTION.crs for this comparison. Since intersects() can be called frequently, this is a small but unnecessary allocation per call. Fix this →

4. WGS84_CLIP_BUFFER_DEGREES = 1.0 is unexplained

A comment explaining why 1 degree was chosen (and any known edge cases where it might be too small, such as features near projection zone boundaries) would help future readers. Fix this →

5. Weaker assertions in test_antimeridian_handling_intersecting

The old test verified the exact reprojected bounds; the new test only checks area > 0 and intersects. This is understandable given the new behavior (no exact clipping), but adding at least a rough bounds check (e.g., result lies within dst_geom ± some epsilon) would make the test more meaningful.


Positive Notes

  • Removing is_too_large() / MAX_GEOMETRY_DEGREES and the forced-global-geometry fallback is a good cleanup; the new approach is more principled.
  • The make_valid guard in to_projection is a nice defensive fix with a clear, accurate comment.
  • intersects() now always going through WGS84 when projections differ is cleaner and correctly handles the antimeridian uniformly.
  • The new get_clip_region() correctly handles MultiPolygon from antimeridian-splitting by buffering each part separately — good attention to detail.
  • Test coverage is solid, including a real-world CDSE geometry fixture with near-duplicate vertices.

favyen2 added 3 commits March 2, 2026 16:29
Many data sources re-project to WGS84 for initial spatiotemporal intersection,
previously they didn't handle antimeridian splitting but we add STGeometry.to_wgs84
function that they can call to handle it.
@favyen2
Copy link
Collaborator Author

favyen2 commented Mar 3, 2026

@claude can you review this PR?

@claude
Copy link

claude bot commented Mar 3, 2026

Claude finished @favyen2's task in 2m 46s —— View job


PR Review

  • Read CLAUDE.md and gather context
  • Review changed files
  • Provide comprehensive review feedback

This is a solid, well-motivated PR. The core strategy — clip in WGS84 to a buffered window before reprojecting — is more principled than both the old direct-reproject approach and the is_too_large() heuristic. The make_valid fix and the new to_wgs84() method are clean improvements. Below are the findings, grouped by severity.


Correctness Issues

1. vector_format.py: returned Feature geometries no longer clipped to bounds

The old safely_reproject_and_clip explicitly clipped the result to dst_geom.shp (confirmed by the removed test_clipping test). The new safely_reproject_within_valid_area clips to a buffered WGS84 region (±1 degree), so features returned from GeojsonVectorFormat.decode_vector can extend several km beyond the requested bounds:

# vector_format.py:362 — only intersection checked, geometry not clipped
if not reproj.shp.intersects(dst_geom.shp):
    continue
reprojected_features.append(Feature(reproj, feat.properties))

Whether this matters depends on callers. If downstream code only needs to find which features overlap, it's fine. If code assumes returned geometries are bounded (e.g. for area computations, rendering, or spatial joins), it can silently produce wrong results. Recommend either adding a reproj.shp.intersection(dst_geom.shp) clip here, or documenting the relaxed contract. Fix this →

2. hf_srtm.py: possible duplicate items when MultiPolygon straddles antimeridian

geometry.py:265-276 now iterates over each component of the flattened MultiPolygon to build the tile list, but there's no deduplication:

for shp in flatten_shape(wgs84_geometry.shp):
    ...
    for lon_min in range(...):
        for lat_min in range(...):
            items.append(self._tile_to_item[(lon_min, lat_min)])  # no seen-set guard

For a window that is a MultiPolygon (e.g. from antimeridian splitting), if multiple parts cover overlapping tile coordinates, the same SRTMItem can be appended more than once. This likely doesn't cause incorrect output (since match_candidate_items_to_window tolerates duplicates) but wastes work. Compare with gcp_public_data.py and aws_open_data.py which correctly use a seen_names set / set() for deduplication. Fix this →


Minor Issues

3. to_wgs84() allocates a new CRS on every call

geometry.py:303:

if self.projection.crs == CRS.from_epsg(WGS84_EPSG):

CRS.from_epsg() constructs a new object each time. Since WGS84_PROJECTION is a module-level constant, this should use WGS84_PROJECTION.crs. The same pattern exists in the old intersects(), but to_wgs84() is called on every non-same-projection intersects() call now (twice per call), so the overhead is doubled. Fix this →

4. intersects() now does two WGS84 reprojections instead of one

Old: reproject other → self.projection (1 reprojection).
New: reproject self → WGS84 and other → WGS84 (2 reprojections).

This is the right tradeoff for antimeridian correctness, but worth noting if intersects() is called at scale (e.g. in tight filtering loops). No action required unless profiling reveals a problem, but it's a useful thing to keep in mind.

5. safely_reproject_within_valid_area: clipped_shp.area == 0 check is fragile for lines/points

geometry.py:646:

if clipped_shp.is_empty or clipped_shp.area == 0:
    results.append(None)
    continue

For line or point source geometries (area = 0 by definition), this will always append None even if they genuinely intersect the valid area. The prior safely_reproject_and_clip used shp_intersects() instead, which handles zero-area geometries. This was an implicit regression for vector data sources dealing with lines/points. Consider using clipped_shp.is_empty alone, or shp_intersects(clipped_shp, get_clip_region()) for the non-empty check. Fix this →


Positive Notes

  • The make_valid guard in to_projection with a clear comment is a clean, targeted fix for the near-duplicate-vertex topology issues.
  • The new to_wgs84() method cleanly separates "reproject to WGS84" from "convert to WGS84 with antimeridian splitting", and using it uniformly across data sources is a good simplification.
  • get_clip_region() correctly buffers each component of a MultiPolygon individually, which is the right approach for antimeridian-split geometries.
  • The CDSE real-world fixture in the test is excellent — it captures a genuine production failure.
  • Removing is_too_large() and the MAX_GEOMETRY_DEGREES heuristic is the right call; the new approach is more principled and handles the same cases more correctly.
  • All affected data sources (aws_landsat, aws_open_data, gcp_public_data, openstreetmap, hf_srtm) correctly adopted to_wgs84() + flatten_shape() for antimeridian-safe bounding-box queries.

@favyen2 favyen2 requested a review from APatrickJ March 3, 2026 18:28
Comment on lines -303 to -313
def is_too_large(self) -> bool:
"""Returns whether this geometry's spatial coverage is too large.
def to_wgs84(self) -> "STGeometry":
"""Convert to WGS84 with antimeridian splitting handling.

This means that it will likely have issues during re-projections and such.
For geometries already in WGS84, this is a no-op.
"""
wgs84_bounds = self.to_projection(WGS84_PROJECTION).shp.bounds
if wgs84_bounds[2] - wgs84_bounds[0] > MAX_GEOMETRY_DEGREES:
return True
if wgs84_bounds[3] - wgs84_bounds[1] > MAX_GEOMETRY_DEGREES:
return True
return False
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do use this function in Studio, though I suppose we could easily write our own version that does the same thing (and ideally we want to remove the rslearn dependency from Studio entirely anyway). But for now, maybe mark this as deprecated?

Unlike direct to_projection(), this clips each source geometry in WGS84 to a
buffered area around valid_geom before reprojecting. This minimizes distortions in
case valid_geom is small but src_geoms may be large. It works best if src_geoms are
also natively in WGS84; otherwise, there could be distortion issues re-projecting
Copy link
Collaborator Author

@favyen2 favyen2 Mar 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like it could be more clear in the overall design that this is only addressing problems when the item geometries (src_geoms here) are in WGS84. If there are big item geometries in other projections, they would still be distorted when we re-project them to WGS84. Ideally we would re-project valid_geom to the geometry of the src_geoms, but the problem is we don't know how to deal with antimeridian splitting and stuff like that in every single CRS possible.

b = part.bounds
boxes.append(
shapely.box(
b[0] - WGS84_CLIP_BUFFER_DEGREES,
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we add max(width, height) here then we would better deal with large window geometries.

Copy link
Collaborator Author

@favyen2 favyen2 Mar 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code hinges on windows being small relative to items but you could still have like 3x3 windows that are much smaller than 10x10 items or something like that.

Comment on lines +359 to +363
if reproj is None:
continue
reprojected_features.append(Feature(geom, feat.properties))
# Only include shapes that intersect the given bounds.
if not reproj.shp.intersects(dst_geom.shp):
continue
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be worth clipping to dst_geom.shp because the reprojected geometry clips to a +/- 1 degree buffer.

            clipped_shp = reproj.shp.intersection(dst_geom.shp)
            clipped_geom = STGeometry(reproj.projection, clipped_shp, reproj.time_range)

Then use clipped_geom instead of reproj below

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants