RTS Segmentation Model v2: Data Specification

Project Context

Objective: Semantic segmentation of Retrogressive Thaw Slumps (RTS) in Arctic satellite imagery for pan-arctic mapping (60-74°N).

Data versioning: Use semantic versioning (major.minor)

Major: Added new training data or significant changes to processing
Minor: No new training data but changes in existing labels

1. Data Sources

1.1 Primary Training Data: PlanetScope Basemap

Attribute	Value
Product	Global Quarterly PlanetScope Basemap
Temporal window	July–September (growing season)
Training year	2024 composites
Inference year	2025 composites
Bands	RGB (3 channels)
Resolution	1.3–3.0 m (varies by latitude; ~3 m in study region)
Effective GSD	~10 m (due to mosaic processing)
Coverage	Below 74°N only
Notes	Proprietary color-correction optimized for CV analytics

1.2 Auxiliary Data Sources

Source	Resolution	Channels/Derivatives	Purpose
Sentinel-2	10 m	NDVI, NIR	Differentiate vegetation from background
ArcticDEM	2 m	Relative Elevation (RE), Shaded Relief (SR)	Terrain context

1.3 Secondary Training Data (Optional)

Source	Resolution	Volume	Use Case
Maxar (Yang et al. 2023)	4 m	900+ images	Cross-sensor generalisation experiments

Note: Sentinel-2, Maxar, and other sensors exhibit domain shift from PlanetScope. Cross-sensor models require separate experimentation.

2. Label Source and Refinement

2.1 Source Dataset

ARTS (Arctic RTS compilation dataset) provides initial polygon locations
Polygons were manually refined on 2024 PlanetScope imagery

Category	Count	Notes
Positive tiles	TBD	From ARTS polygon refinement
Negative tiles	TBD	From ARTS confirmed negatives + hard negatives

2.2 Labeling Criteria

Include in label (AND):

Visible headwall with cast shadow (visually concaved landscape)
Barren slump floor (indicates active RTS)
Clear morphological distinction from surrounding terrain

Exclude from label (OR):

Features too small to show clear diagnostic characteristics at PlanetScope resolution
Ambiguous features lacking obvious headwall shadow
Inactive/stabilized RTS without barren floor
Long debris tongue or mudflow that is distance from the slump floor (no thawing)

2.3 Consistency

HOW TO DEFINE BOUNDARY: enforcing perfect consistency would be exhausting and arguably impossible for gradational geomorphic boundaries. The label boundary always try to follow the visible colour/texture contrast between disturbed and undisturbed ground; where no clear contrast exists, connect the endpoints of the headwall arc with a natural curve.

PATTERN: made sure the overall morphology read as a collapsed landscape (concaved).

2.3 Partial Target Handling

This is critical for training data quality:

Scenario	Action
Complete RTS fully within tile	Label as RTS
Partial RTS with both headwall and floor visible	Label as RTS
Partial RTS with only floor visible (no headwall in tile)	Ignore Index：255
Partial RTS with only headwall visible (no floor in tile)	Ignore Index：255

Rationale: The model learns that "only barren floor associated with a headwall with shadow is RTS." Overlapping inference tiles ensure partial targets are detected where both features are visible. Use an Ignore Index （255） for pixels that are part of an RTS but lack the diagnostic headwall in that specific tile. This prevents the model from learning conflicting information while maintaining strict detection criteria. This is feature engineering with domain knowledge which especially important when training data is limited.

How to make the decision of whether a partial object should be trained: put the partial object to isolation (masking the adjacent tiles when labelling), if the partial object can be confirmed without neighbour tiles context, train it.

2.4 Label Values

Value	Meaning
0	Background (no RTS)
1	RTS (positive class)
255	Ignore

The ignore values could be applied to several conditions, for example:

partial object that can't be confirmed without neighbouring tile context, even if it's obvious if with context
RTS-like features that can't be confirmed under the Planet image quality/resolution, even if it's clear in Esri basemap

3. Training Image Specification

3.1 Tile Configuration

Parameter	Value
Tile size	512 × 512 pixels
Spatial coverage	~1.5 km × 1.5 km (at 3 m resolution)
CRS	EPSG:3857 (Pseudo-Mercator -- Spherical Mercator, Google Maps, OpenStreetMap, Bing, ArcGIS, ESRI)
Format	GeoTIFF
Grid alignment	Planet tile grid (same grid used for polygon refinement)

3.2 File Naming Convention

{tile_id}.tif

3.3 File Structure

Data lives in the GCS bucket, mounted via gcsfuse at training time. All paths are configured in the YAML config — no hardcoded paths in code:

data:
  root: "gs://abruptthawmapping/training/v2.0"  # configure per environment
  rgb_dir: "PLANET-RGB"
  extra_dir: "EXTRA"
  labels_dir: "labels"
  metadata_file: "metadata.csv"
  splits_file: "splits.yaml"

GCS directory layout:

{data.root}/
├── PLANET-RGB/
│   ├── 000001.tif
│   ├── 000002.tif
│   └── ...
├── EXTRA/
│   ├── 000001.tif          ← multi-band, channel order per §9
│   ├── 000002.tif
│   └── ...
├── labels/
│   ├── 000001.tif
│   └── ...
├── metadata.csv
├── splits.yaml             ← lists region names per split (see below)
└── splits_summary.json     ← generated by scripts/create_splits.py; reports tile counts

splits.yaml example format — lists Arctic region names per split (not individual tile IDs): more details see domain/inference_domain.md

train:
  - elias range tundra
val:
  - arctic foothills tundra
test:
  - beringia lowland tundra

The DataLoader cross-references metadata.csv (which has RegionName per tile) to resolve tile IDs for each split. scripts/create_splits.py generates both splits.yaml and splits_summary.json (with per-split tile counts); both are committed to the repo for reproducibility. metadata.csv:

Tile_id	centroid_lat	centroid_lon	TrainClass	RegionName	UIDs
0001	xx	xx	Negative	yakutia
0002	xx	xx	Positive	yakutia	xxx, xxx

Note: TrainClass values are Positive or Negative only. Hard negatives, if exist (e.g. from Lingcao Huang's model false positives), are stored as Negative — no separate class needed. UIDs are RTS UIDs contained within the tile (used for tracking individual RTS); empty for Negative tiles. RegionName is Arctic subregion defined by ecology/permafrost extent (boundaries provided by Heidi Rodenhizer, see files in '/domain'). split.yaml (e.g.):

train:
  - elias range tundra
  - 
  - 
val:
  - beringia lowland tundra
  - 
test:
  - arctic foothills tundra
  -

PLANET-RGB: derived from PlanetScope Basemap

Image: (512, 512, 3) — RGB
Label: (512, 512, 1) — uint8, values {0, 1， 255}

EXTRA: derived from other sources, resolution resampled to match the RGB

Image: (512, 512, N) — multi-band GeoTIFF, band count N and band order chosen per experiment
Label: (512, 512, 1) — uint8, values {0, 1, 255}

Example bands currently in use: NDVI, NIR (from Sentinel-2), RE (Relative Elevation), SR (Shaded Relief) from ArcticDEM. These are examples, not a fixed contract — any multi-band EXTRA raster works (e.g. slope, aspect from ArcticDEM, NDMI, NBR, SAR, GEE satellite embeddings).

Channel selection at training time: Specified in the YAML config (see configs/baseline.yaml §channels) as a list of {name, band} entries. name is an arbitrary label used in normalization_stats.json and logs; band is the 0-indexed position inside the EXTRA GeoTIFF. Example:

channels:
  extra:
    - {name: ndvi, band: 0}
    - {name: nir,  band: 1}
    - {name: re,   band: 2}
    - {name: sr,   band: 3}

Changing the stacked EXTRA set = edit the YAML. No code change. §9 below shows one concrete layout but is not authoritative.

Build order: Generate planet_rgb first for positive and negative samples, then derive EXTRA by extracting auxiliary channels with the planet_rgb extent (footprint).

3.4 EXTRA Channel Processing

All auxiliary data must be:

Reprojected to EPSG:3857
Resampled to match PlanetScope nominal resolution (~3 m) using bilinear interpolation for all channels
Co-registered with RGB using GeoTIFF bounding box information
Stacked as channels in an order you keep stable across the dataset (that same order is what you reference by band index in the YAML config). §9 shows one example layout.

4. Data Values

Both PLANET-RGB and EXTRA should store raw values

Both PLANET-RGB and EXTRA store raw values (no normalization applied to stored files).

4.1 Normalization

Before computing statistics, apply percentage clipping to remove outliers. This is a one-off step during statistics computation (not applied per-image at load time):

Run scripts/check_data.py first to visualise per-channel histograms and choose appropriate percentile bounds
PlanetScope pre-processing may have already handled normalisation and outliers, the normalisation in this step is mostly for value alignment with the pretrained backbone, rather than improving image quality.
Pass clipping percentiles as arguments to scripts/compute_normalization_stats.py
Clipping percentile decision should be decided by looking at the histogram. to use histogram to decide the clipping optimal value: calculate a histogram with all available postive and negative tiles and save the figure and raw data for manual assessment.
The computed mean/std (on clipped data) are saved in normalization_stats.json — the DataLoader uses only those stored values

Normalisation Should be calculated per-dataset, rather than per-image, to:

Consistent inference behavior regardless of batch composition
Satellite imagery has consistent acquisition conditions within a sensor

Normalisation for EXTRA should be done channel-specific to respect the physical signal meanings

Use per-dataset statistics computed once over the entire training set. This preserves absolute radiometric information critical for distinguishing RTS features.

4.2 Statistics Computation

Compute mean and standard deviation for each channel across all training tiles:

For RGB: compute over all training images (both positive and negative)
For EXTRA: compute separately for each channel respecting physical meaning

4.3 Storage Specification

Store normalization statistics in a JSON file that travels with the model:

models/
├── experiment_name/
│   ├── normalization_stats.json

normalization_stats.json structure:

Field	Description
dataset_version	Version string from `data/version.json` (e.g. "2.0"). This file is created as part of the data pipeline and committed to the repo.
computed_date	ISO timestamp of computation
n_tiles_used	Number of tiles used in computation
rgb.channel_names	Fixed: `["R", "G", "B"]`
rgb.mean	List of 3 values, order matches `rgb.channel_names`
rgb.std	List of 3 values, order matches `rgb.channel_names`
extra.channel_names	List of N names declared in the config (e.g. `["ndvi", "nir", "re", "sr"]`). Omit the whole `extra` block when training RGB-only.
extra.mean	List of N values, order matches `extra.channel_names`
extra.std	List of N values, order matches `extra.channel_names`

Note: the above extracts mean and std for z-score standardisation, can also get mins and maxs for 0-1 normalisation.

compute: Loading terabytes of GeoTIFFs to calculate mean/std can be challenging. Suggestion: Use Welford’s Online Algorithm to compute mean/variance in a single pass without loading all data

5. Imbalance and Split

Estimation	Value
Within Positive tiles	5–70% of tile area
Real Arctic prevalance	0.1-0.5%

5.1 Split Ratios

Split	Positive	Negative	Purpose
Train	80% of N	80% * M / f1	Model training by curriculum learning
Val-Balanced	10% of N	10% * M / f2a	quick val during training
Val-Realistic	10% of N	10% * M / f2b	Early stopping, full val during training
Test-Realistic	10% of N	10% * M / f3	Final test score to report

Factor	Value
N	Number of positive tiles
M	Number of negative tiles
f1 (training)	Start at 1:1, warm up to 1:20
f2a (val-balanced)	1:20 (physical pool size; subsampled to 1:1 at evaluation time)
f2b (val-realistic)	1:200, 1:1000
f3 (test-realistic)	1:200, 1:1000

6 Spatial Blocking

6.1 Purpose

Prevent geographic data leakage between train/val/test splits. RTS in nearby tiles may share visual characteristics due to:

Similar geology and permafrost conditions
Similar vegetation patterns
Correlated image acquisition conditions

6.2 Blocking Strategy

Group tiles by Arctic subregion based on ecology/permafrost extent. Entire regions are assigned to train, val, or test — no region spans multiple splits.

Region definitions: Provided by Heidi Rodenhizer (boundaries complete). Storage format is geojson. see files in '/domain'.

Implementation: scripts/create_splits.py reads metadata.csv (which has RegionName per tile) and assigns each region to train/val/test according to the target ratios. The output splits.yaml is committed to the repo for reproducibility.

6.3 Tie-Break Rules for Region Assignment

Whole-region assignment and 80/10/10 tile-count targets cannot both be satisfied exactly when regions vary in size and RTS density. ‘scripts/create_splits.py’ resolves conflicts using the following priority order. Constraints earlier in the list take precedence; when one fails, the script exits with an error rather than silently producing a degenerate split.

Test set minimum positives — Test set must contain at least 100 positive tiles to enable statistically meaningful PR-AUC reporting at 1:1000 prevalence. If no whole-region assignment achieves this, fail loudly.
Validation set ecoregion diversity — Val set must span at least 2 distinct ecoregions, so that early-stopping decisions are not tied to a single regional artifact. If only one region can be assigned to val without breaking constraint 1, fail loudly.
Train set positive coverage — Train set should hold at least 70% of total positive tiles. Below this, emit a warning; do not fail.
Tile-count ratio drift tolerance — Once constraints 1–3 are satisfied, accept up to ±10% drift from the 80/10/10 target (train: 70–90% of total tiles). Beyond this, fail loudly.

Geographic priority: assign the largest, most RTS-dense regions to train, the most diverse subset to val, and morphologically representative regions to test. The script logs its assignment reasoning.

Outputs:

splits.yaml — region assignments per split (committed) splits_summary.json — per-split tile counts, positive counts, region list, observed vs. target drift, and the constraint-resolution log (committed)

If the available regions cannot satisfy constraints 1–2, the project needs more labelled regions before splitting — not a softer split rule.

7. Negative Data Strategy

7.1 Sources

ARTS confirmed negatives: Known non-RTS locations
Hard negatives: False positive locations from Lingcao Huang's model (could be a source, but no plan for implement now)

Both are stored with TrainClass = Negative in metadata.csv. No separate class distinction is needed in code — treated identically during sampling.

7.2 Augmentation

Negative samples can be inflated on-the-fly through augmentation to achieve desired imbalance ratios. See Training Guide for augmentation strategy.

8. Data Check

Run before training:

All positive tiles contain RTS pixels (label sum > 0)
All negative tiles contain no RTS pixels (label sum = 0)
RGB values are in valid range (0-255 for uint8)
All tiles have matching image and label dimensions
No NaN or infinite values in EXTRA channels
GeoTIFF metadata (CRS, bounds) is consistent across tiles
metadata.csv has entries for all tiles
Spatial blocking is respected (no region spans train/val/test)

9. Channel Index Reference

RGB band order is fixed. EXTRA is declared per-experiment in the YAML config — the table below is one example, not a contract. Keep whatever layout you write to disk consistent across the dataset and referenced correctly in configs/*.yaml §channels.extra.

RGB (fixed)

Index	Channel
0	Red
1	Green
2	Blue

EXTRA (example layout — adjust freely per experiment)

Index	Channel	Source
0	NDVI	Sentinel-2
1	NIR	Sentinel-2
2	Relative Elevation (RE)	ArcticDEM
3	Shaded Relief (SR)	ArcticDEM

Other examples: slope, aspect, NDMI, NBR, SAR backscatter, GEE satellite embeddings. Any combination works as long as the YAML points at the right band indices.

Label File

Value	Meaning
0	Background
1	RTS
255	ignore

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RTS Segmentation Model v2: Data Specification

Project Context

1. Data Sources

1.1 Primary Training Data: PlanetScope Basemap

1.2 Auxiliary Data Sources

1.3 Secondary Training Data (Optional)

2. Label Source and Refinement

2.1 Source Dataset

2.2 Labeling Criteria

2.3 Consistency

2.3 Partial Target Handling

2.4 Label Values

3. Training Image Specification

3.1 Tile Configuration

3.2 File Naming Convention

3.3 File Structure

3.4 EXTRA Channel Processing

4. Data Values

4.1 Normalization

4.2 Statistics Computation

4.3 Storage Specification

5. Imbalance and Split

5.1 Split Ratios

6 Spatial Blocking

6.1 Purpose

6.2 Blocking Strategy

6.3 Tie-Break Rules for Region Assignment

7. Negative Data Strategy

7.1 Sources

7.2 Augmentation

8. Data Check

9. Channel Index Reference

RGB (fixed)

EXTRA (example layout — adjust freely per experiment)

Label File

FilesExpand file tree

data.md

Latest commit

History

data.md

File metadata and controls

RTS Segmentation Model v2: Data Specification

Project Context

1. Data Sources

1.1 Primary Training Data: PlanetScope Basemap

1.2 Auxiliary Data Sources

1.3 Secondary Training Data (Optional)

2. Label Source and Refinement

2.1 Source Dataset

2.2 Labeling Criteria

2.3 Consistency

2.3 Partial Target Handling

2.4 Label Values

3. Training Image Specification

3.1 Tile Configuration

3.2 File Naming Convention

3.3 File Structure

3.4 EXTRA Channel Processing

4. Data Values

4.1 Normalization

4.2 Statistics Computation

4.3 Storage Specification

5. Imbalance and Split

5.1 Split Ratios

6 Spatial Blocking

6.1 Purpose

6.2 Blocking Strategy

6.3 Tie-Break Rules for Region Assignment

7. Negative Data Strategy

7.1 Sources

7.2 Augmentation

8. Data Check

9. Channel Index Reference

RGB (fixed)

EXTRA (example layout — adjust freely per experiment)

Label File