Skip to content

Added AlphaEarth datasource#562

Closed
robmarkcole wants to merge 3 commits intoallenai:masterfrom
robmarkcole:add-alphaearth
Closed

Added AlphaEarth datasource#562
robmarkcole wants to merge 3 commits intoallenai:masterfrom
robmarkcole:add-alphaearth

Conversation

@robmarkcole
Copy link
Copy Markdown
Collaborator

@robmarkcole robmarkcole commented Mar 16, 2026

  • Added a new AlphaEarth datasource backed by the public Source Cooperative STAC GeoParquet index, with direct materialization support, optional dequantization, and minimal new dependencies.
  • Documented and supported use of use_all_bands_in_order_of_band_set_idx so AlphaEarth embeddings can be consumed in model configs without repeating band names.
  • Added datasource-aware default nodata handling during materialization. AlphaEarth now defaults to -2.0 in dequantized mode and -128 in raw mode when nodata_vals is not specified.
  • Updated AlphaEarth docs to recommend ingest: false because annual TIFFs are large, and clarified dequantized embedding semantics, including that exact unit norm is only approximate after quantization and that optional L2 re-normalization may be useful for cosine-based workflows.
  • Added unit and integration tests covering generated band names, model-input band resolution, datasource default nodata behavior, and AlphaEarth item lookup, ingest, and raster reads
  • Extended raster band-set configuration so consecutive band names can be generated from num_bands with a custom prefix, start index, and zero padding, allowing AlphaEarth’s 64 bands to be configured without listing them explicitly.
// Before
{
  "type": "raster",
  "band_sets": [{
    "dtype": "float32",
    "bands": [
      "A00", "A01", "A02", "A03", "A04", "A05", "A06", "A07",
      "A08", "A09", "A10", "A11", "A12", "A13", "A14", "A15",
      "A16", "A17", "A18", "A19", "A20", "A21", "A22", "A23",
      "A24", "A25", "A26", "A27", "A28", "A29", "A30", "A31",
      "A32", "A33", "A34", "A35", "A36", "A37", "A38", "A39",
      "A40", "A41", "A42", "A43", "A44", "A45", "A46", "A47",
      "A48", "A49", "A50", "A51", "A52", "A53", "A54", "A55",
      "A56", "A57", "A58", "A59", "A60", "A61", "A62", "A63"
    ]
  }]
}

// After
{
  "type": "raster",
  "band_sets": [{
    "dtype": "float32",
    "num_bands": 64,
    "band_prefix": "A",
    "band_zero_pad": 2
  }]
}

Example config

        "alphaearth": {
            "type": "raster",
            "band_sets": [
                {
                    "dtype": "float32",
                    "num_bands": 64,
                    "band_prefix": "A",
                    "band_zero_pad": 2
                }
            ],
            "data_source": {
                "class_path": "rslearn.data_sources.alphaearth.AlphaEarth",
                "init_args": {
                    "metadata_cache_dir": "cache/alphaearth",
                    "index_url": "https://data.source.coop/tge-labs/aef/v1/annual/aef_index_stac_geoparquet.parquet",
                    "apply_dequantization": true
                },
                "ingest": false
            }
        }
    },

Materialised raster

image

@favyen2
Copy link
Copy Markdown
Collaborator

favyen2 commented Mar 22, 2026

The existing data source https://github.com/allenai/rslearn/blob/master/rslearn/data_sources/aws_google_satellite_embedding_v1.py also pulls from the Source Cooperative data but using it from their AWS S3 bucket, is there a substantial difference with this one? For the bands, since we have num_bands already I think it would make sense to align the band names with the ones automatically generated by num_bands for these embedding datasets instead of adding the band_prefix/band_zero_pad options since the latter would add more things that users would need to figure out how to configure in order to use the data source.

@robmarkcole
Copy link
Copy Markdown
Collaborator Author

I wasn't aware of the existing data source, will switch to using that (although no immediate plans to use it again). Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants