Skip to content

Cache GeoTIFF metadata to skip remote reads on rebuild#19

Merged
NewGraphEnvironment merged 1 commit into
mainfrom
10-cache-geotiff-metadata
Feb 19, 2026
Merged

Cache GeoTIFF metadata to skip remote reads on rebuild#19
NewGraphEnvironment merged 1 commit into
mainfrom
10-cache-geotiff-metadata

Conversation

@NewGraphEnvironment
Copy link
Copy Markdown
Owner

Summary

  • Replace subprocess rio cogeo validate with rasterio-based geotiff_extract_metadata() that extracts spatial metadata (CRS, bounds, shape, transform) + validates COG status in one remote read
  • Add item_create_from_cache() to build pystac Items from cached metadata with zero network I/O
  • Extend stac_geotiff_checks.csv with spatial columns (epsg, height, width, transform, bounds) — backward compatible with old-format rows
  • Update item_create.py and item_reprocess.py to use cache hit path, fall back to rio_stac on miss

Performance

  • Cache warm: 10 items created at 65k items/sec (vs ~3 items/sec with remote reads)
  • One-time baseline extraction needed to populate spatial columns for existing 60k URLs
  • After baseline: full rebuilds drop from ~5.5 hours to minutes

Test plan

  • 3-item test: metadata extracted, cached, items created from cache
  • 10-item test: mix of cache hits and misses, all items created correctly
  • Re-run with warm cache: zero remote reads, zero GDAL warnings
  • Output JSON matches existing prod items (proj extension fields, bbox, geometry)
  • Full baseline extraction on VM (one-time ~5.5 hr run to populate cache)

Relates to NewGraphEnvironment/sred-2025-2026#8

🤖 Generated with Claude Code

Replace subprocess-based check_geotiff_cog() with geotiff_extract_metadata()
that extracts CRS, bounds, shape, and transform via rasterio in one read.
Add item_create_from_cache() to build pystac Items from cached metadata
without any network I/O. Full rebuild drops from ~5.5 hours to minutes
once cache is populated.

Closes #10
Relates to NewGraphEnvironment/sred#8

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@claude
Copy link
Copy Markdown

claude Bot commented Feb 19, 2026

Review

Bug: src.crs may be NoneAttributeError swallowed silently

In geotiff_extract_metadata (stac_utils.py):

epsg = src.crs.to_epsg()  # AttributeError if src.crs is None

GeoTIFFs without a defined CRS will crash here, get caught by except Exception, and be marked is_geotiff: False. That's wrong — the file is readable, it just lacks a CRS. Fix:

epsg = src.crs.to_epsg() if src.crs else None

Bug: Non-EPSG CRS files loop forever through re-extraction

Any file with a valid CRS that can't be expressed as an EPSG code returns is_geotiff: True, epsg: None. In load_validation_cache, these URLs land in needs_upgrade every run because the elif row.get("is_geotiff") branch fires again. They never graduate to has_spatial, so every run triggers a remote read for them. Uncommon in BC DEM data but worth handling.

Performance: Two remote file opens per URL, not one

The PR summary says "one remote read" but cog_validate() opens the file independently after rasterio. Not a correctness issue, but the baseline extraction will be slower than claimed.

Observability: Silent exception swallowing

except Exception:
    return {"url": url, "is_geotiff": False, ...}

No log of what failed or why. Add at minimum:

except Exception as e:
    logger.warning("Failed to read %s: %s", url, e)

Otherwise debugging failures in a 60k-URL run is very hard.


The cache-hit path and upgrade detection logic are otherwise correct. The performance gain for warm-cache rebuilds is real.

@NewGraphEnvironment NewGraphEnvironment merged commit 6cb37a4 into main Feb 19, 2026
2 checks passed
@NewGraphEnvironment NewGraphEnvironment deleted the 10-cache-geotiff-metadata branch February 19, 2026 02:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant