Conversation
|
@claude can you review this PR |
|
Claude finished @Farbum's task in 2m 15s —— View job PR Review: NumpyRasterFormat speed-up
Overall, this is a well-scoped PR with good test coverage and clear motivation. The core approach ( Bugs1. Shape validation in
expected_shape = (
metadata.num_channels,
metadata.num_timesteps,
array.shape[2], # ← pulled from the array itself!
array.shape[3], # ← same
)
if array.shape != expected_shape:
...The spatial dimensions ( expected_shape = (
metadata.num_channels,
metadata.num_timesteps,
metadata.height, # store these in NumpyRasterMetadata
metadata.width,
)2. No validation that
if self.spatial_size is not None and self.zoom_offset != 0:
raise ValueError(...)
if self.spatial_size is not None and len(self.spatial_size) != 2:
raise ValueError(...)A if self.spatial_size is not None and any(v <= 0 for v in self.spatial_size):
raise ValueError("spatial_size values must be positive integers")Design / correctness concerns3. The At minimum, consider adding an assertion that the stored bounds match the requested ones, or logging a warning: if metadata.bounds != bounds:
# stored at different bounds than requested — caller should be aware
warnings.warn(...)4.
new_bounds = (
round(bounds[0] * x_factor),
round(bounds[1] * y_factor),
round(bounds[0] * x_factor) + target_w,
round(bounds[1] * y_factor) + target_h,
)Multiplying the pixel min-coordinate by the scale factor performs rounding that can shift the geographic origin. For example, bounds Minor5. The Positive notes
|
|
Addressed Claude comments |
TLDR
Add NumpyRasterFormat and spatial_size config option to efficiently materialize and load coarse-resolution temporal layers (e.g. ERA5) as compact NumPy arrays instead of large multi-band GeoTIFFs.
Motivation
When materializing coarse-resolution temporal data (e.g. ERA5 at 0.1° with 14 variables × 365 days), two problems arise:
Expensive GeoTIFF reads: Storing 5,000+ bands in a single GeoTIFF makes loading extremely slow due to format overhead (headers, compression, tiling metadata per band).
zoom_offset limitations: zoom_offset only supports power-of-2 scaling, so it cannot reduce a 128×128 window to a single pixel for data that has only one spatial point per window. This leads to redundant spatial replication of the same value across all pixels -> exploding dataset sizes.
Changes
This PR introduces two new features to address both problems:
NumpyRasterFormat: A new RasterFormat that stores raster data as raw .npy files with a metadata.json sidecar. This avoids all GeoTIFF overhead, making reads ~500× faster for arrays with thousands of bands (e.g. 14 channels × 365 timesteps = 5,110 bands). Supports full (C, T, H, W) round-tripping including timestamps.
spatial_size on BandSetConfig: A new config option (e.g. spatial_size: [1, 1]) that controls the output spatial dimensions during materialization. The projection resolution is adjusted so the window's geographic extent maps to exactly the requested pixel count, allowing compact (C, T, 1, 1) output for coarse-resolution layers. Mutually exclusive with zoom_offset.
Together, these enable an ERA5 layer to be materialized and loaded as a (14, 365, 1, 1) NumPy array instead of a 5,110-band GeoTIFF at 128×128 — orders of magnitude faster and smaller.