Skip to content

NumpyRasterFormat speed-up#564

Open
Farbum wants to merge 4 commits intomasterfrom
hadriens/numpyraster
Open

NumpyRasterFormat speed-up#564
Farbum wants to merge 4 commits intomasterfrom
hadriens/numpyraster

Conversation

@Farbum
Copy link
Contributor

@Farbum Farbum commented Mar 16, 2026

TLDR
Add NumpyRasterFormat and spatial_size config option to efficiently materialize and load coarse-resolution temporal layers (e.g. ERA5) as compact NumPy arrays instead of large multi-band GeoTIFFs.

Motivation
When materializing coarse-resolution temporal data (e.g. ERA5 at 0.1° with 14 variables × 365 days), two problems arise:
Expensive GeoTIFF reads: Storing 5,000+ bands in a single GeoTIFF makes loading extremely slow due to format overhead (headers, compression, tiling metadata per band).
zoom_offset limitations: zoom_offset only supports power-of-2 scaling, so it cannot reduce a 128×128 window to a single pixel for data that has only one spatial point per window. This leads to redundant spatial replication of the same value across all pixels -> exploding dataset sizes.

Changes
This PR introduces two new features to address both problems:
NumpyRasterFormat: A new RasterFormat that stores raster data as raw .npy files with a metadata.json sidecar. This avoids all GeoTIFF overhead, making reads ~500× faster for arrays with thousands of bands (e.g. 14 channels × 365 timesteps = 5,110 bands). Supports full (C, T, H, W) round-tripping including timestamps.
spatial_size on BandSetConfig: A new config option (e.g. spatial_size: [1, 1]) that controls the output spatial dimensions during materialization. The projection resolution is adjusted so the window's geographic extent maps to exactly the requested pixel count, allowing compact (C, T, 1, 1) output for coarse-resolution layers. Mutually exclusive with zoom_offset.
Together, these enable an ERA5 layer to be materialized and loaded as a (14, 365, 1, 1) NumPy array instead of a 5,110-band GeoTIFF at 128×128 — orders of magnitude faster and smaller.

@Farbum
Copy link
Contributor Author

Farbum commented Mar 16, 2026

@claude can you review this PR

@claude
Copy link

claude bot commented Mar 16, 2026

Claude finished @Farbum's task in 2m 15s —— View job


PR Review: NumpyRasterFormat speed-up

  • Read changed files and analyze diff
  • Post review feedback

Overall, this is a well-scoped PR with good test coverage and clear motivation. The core approach (.npy + metadata.json sidecar, plus spatial_size config) is sound. A few issues worth addressing before merging:


Bugs

1. Shape validation in decode_raster is misleading and nearly a no-op

rslearn/utils/raster_format.py lines 871–882:

expected_shape = (
    metadata.num_channels,
    metadata.num_timesteps,
    array.shape[2],   # ← pulled from the array itself!
    array.shape[3],   # ← same
)
if array.shape != expected_shape:
    ...

The spatial dimensions (H, W) of expected_shape are taken from array itself, so the check array.shape != expected_shape only ever tests whether array.shape[0] == metadata.num_channels and array.shape[1] == metadata.num_timesteps. Corrupt spatial metadata will never be caught. The fix would be to store H and W in the metadata and validate against those:

expected_shape = (
    metadata.num_channels,
    metadata.num_timesteps,
    metadata.height,     # store these in NumpyRasterMetadata
    metadata.width,
)

Fix this →


2. No validation that spatial_size values are positive

rslearn/config/dataset.py lines 210–218:

if self.spatial_size is not None and self.zoom_offset != 0:
    raise ValueError(...)
if self.spatial_size is not None and len(self.spatial_size) != 2:
    raise ValueError(...)

A spatial_size=[0, 1] or spatial_size=[-1, 1] would pass validation but cause a ZeroDivisionError or nonsensical output in get_final_projection_and_bounds. Should add:

if self.spatial_size is not None and any(v <= 0 for v in self.spatial_size):
    raise ValueError("spatial_size values must be positive integers")

Fix this →


Design / correctness concerns

3. decode_raster silently ignores projection and bounds

The RasterFormat contract is that decode_raster(path, projection, bounds) returns raster data for the requested area. NumpyRasterFormat breaks this contract by always returning the stored data regardless of what projection and bounds are passed. This is documented, but it means a caller passing the wrong bounds (e.g. due to a spatial_size mismatch) will get silently incorrect data rather than an error or reprojection.

At minimum, consider adding an assertion that the stored bounds match the requested ones, or logging a warning:

if metadata.bounds != bounds:
    # stored at different bounds than requested — caller should be aware
    warnings.warn(...)

4. new_bounds origin arithmetic can shift geographic position

rslearn/config/dataset.py lines 254–259:

new_bounds = (
    round(bounds[0] * x_factor),
    round(bounds[1] * y_factor),
    round(bounds[0] * x_factor) + target_w,
    round(bounds[1] * y_factor) + target_h,
)

Multiplying the pixel min-coordinate by the scale factor performs rounding that can shift the geographic origin. For example, bounds (100, 200, 228, 328) at 10m/px → scaled with x_factor=1/128 gives round(0.78)=1, placing the output pixel at geographic coordinate 1×1280=1280m instead of the actual origin at 100×10=1000m (a 280m shift). This is likely acceptable for ERA5-scale data where spatial variation within a window is negligible, but the behaviour should be noted in the docstring.


Minor

5. NumpyRasterMetadata.dtype is stored but unused in decode

The dtype field is written to metadata.json but decode_raster does not validate it against array.dtype. Since .npy files carry dtype natively this is fine functionally, but the field gives a false impression of validation. Either use it (e.g., assert array.dtype.name == metadata.dtype) or remove it from the metadata.


Positive notes

  • The train/dataset.py change (using src.shape[2/3] instead of final_bounds dimensions) is the right fix and makes the tensor allocation format-agnostic.
  • Mutual exclusivity of spatial_size and zoom_offset is correctly enforced.
  • Test coverage is good: round-trips for single/multi-timestep, dtype preservation, file existence, and config validation are all covered.
  • NumpyRasterMetadata as a Pydantic model for the sidecar is a clean approach.

@Farbum Farbum marked this pull request as ready for review March 17, 2026 16:32
@Farbum
Copy link
Contributor Author

Farbum commented Mar 17, 2026

Addressed Claude comments

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant