NewGraphEnvironment · NewGraphEnvironment · Mar 22, 2026 · Mar 22, 2026 · Mar 22, 2026
diff --git a/scripts/README.md b/scripts/README.md
@@ -0,0 +1,158 @@
+# Pipeline Scripts
+
+This pipeline builds a searchable catalog of British Columbia's Digital Elevation Model (DEM) data. It takes ~58,000 GeoTIFF files hosted on the provincial objectstore, validates them, generates standardized metadata records, and registers them in a searchable catalog so anyone can find elevation data by location and time.
+
+## Key Concepts
+
+**DEM (Digital Elevation Model)** — A grid of elevation values representing the shape of the ground surface. Each pixel stores a height value (in metres). Used for slope analysis, flood modelling, watershed delineation, and terrain visualization.
+
+**GeoTIFF** — An image file format that embeds geographic coordinate information (projection, position, pixel size) directly in the file. This means GIS software knows exactly where on Earth the image belongs without needing a separate location file.
+
+**COG (Cloud Optimized GeoTIFF)** — A GeoTIFF organized internally so that a viewer can request just the piece it needs (e.g. a zoomed-in corner) over the internet, without downloading the whole file. The pipeline detects which source files are COGs and tags them accordingly in the catalog.
+
+**STAC (SpatioTemporal Asset Catalog)** — A standard way to describe geographic datasets with where-and-when metadata. Think of it as a library catalog for spatial data: each file gets a JSON record describing its location, date, and download link. This makes the collection searchable — "show me all DEMs that overlap this watershed" or "show me DEMs acquired after 2020."
+
+**S3** — Cloud file storage (Amazon-compatible). The generated catalog JSON files are uploaded here so they are accessible via URL from anywhere.
+
+**pgstac** — A PostgreSQL database that stores STAC records and exposes them through a search API. Hosted at `images.a11s.one`, this is what allows users to search the collection by location from QGIS, a web browser, or any STAC-compatible tool.
+
+**Date extraction** — The source GeoTIFFs don't carry acquisition dates in their internal metadata, so the pipeline infers dates from the filename. It looks for a pattern like `_utm10_20230415.tif` (after `_utmXX_`, grab the 4–8 digit date) to get a full date (`YYYYMMDD`) or just a year (`YYYY`). If neither pattern is found, it falls back to looking for a `/YYYY/` directory in the URL path. Files with no detectable date get a placeholder (`2000-01-01`) and are flagged with `datetime_unknown=True` so they can be filtered or fixed later.
+
+**Validation caching** — The pipeline reads each remote GeoTIFF once to extract metadata (projection, dimensions, bounds, COG status) and saves the results to a local CSV. On subsequent runs, items are built from the cache instead of re-reading remote files. This is what makes incremental updates fast (minutes instead of hours).
+
+## Quick Start
+
+```bash
+# Full safe build (backup, fetch, validate, create, check)
+./scripts/build_safe.sh
+
+# Or run individual steps from the project root
+Rscript scripts/urls_fetch.R
+python scripts/urls_check_access.py
+python scripts/collection_create.py
+python scripts/item_create.py
+python scripts/item_validate.py
+Rscript scripts/s3_sync.R
+```
+
+## Pipeline Steps
+
+| Step | Script | What it does |
+|------|--------|--------------|
+| 0 | `detect_changes.R` | Compare the cached URL list against a fresh objectstore listing to find new or deleted files — this drives incremental updates |
+| 1 | `urls_fetch.R` | Fetch the master list of DEM GeoTIFF URLs from the BC objectstore (~58,000 files), filtering out filenames with parentheses that fail validation |
+| 2 | `urls_check_access.py` | Verify source URLs are actually reachable (parallel HTTP HEAD checks), flagging 403s or other access problems |
+| 3 | `collection_create.py` | Create the top-level STAC collection record (`collection.json`) with spatial and temporal extent metadata |
+| 4 | `item_create.py` | The main workhorse — read each GeoTIFF's metadata remotely, cache it, and generate a STAC JSON record for each file (32 parallel workers) |
+| 5 | `item_validate.py` | Check every generated STAC JSON against the spec using pystac, producing a pass/fail report |
+| 6 | `s3_sync.R` | Sync the local catalog to the S3 bucket, uploading only new or changed files |
+| — | `build_safe.sh` | Orchestrates steps 1–5 with automatic backups, timestamped build directories, and optional auto-promotion to production |
+| — | `catalogue_qa.py` | Spot-check QA — randomly samples items and compares local vs S3 versions to catch sync issues |
+
+### Fix-up Scripts
+
+When validation finds problems, these scripts help:
+
+| Script | What it does |
+|--------|--------------|
+| `item_extract_invalid.py` | Pull failed item IDs from the validation report and convert them back to source URLs |
+| `item_reprocess.py` | Re-create invalid items with improved handling (e.g. placeholder dates for files missing date information) |
+
+### Supporting Scripts
+
+| Script | What it does |
+|--------|--------------|
+| `stac_utils.py` | Shared Python utilities — metadata extraction, date parsing, URL encoding, constants (paths, BC bounding box) |
+| `functions.R` | R utilities for VM deployment and table formatting |
+| `staticimports.R` | Auto-generated R helper functions |
+| `utils.R` | Minimal R utilities |
+| `benchmark_fetch.R` | Timing benchmarks for URL fetching approaches |
+| `footprint_visualize.R` | Visualize DEM tile footprints on a map |
+| `stac_examples.qmd` | Example STAC API queries for exploring the finished catalog |
+
+## Data Flow
+
+```
+BC Objectstore (nrs.objectstore.gov.bc.ca/gdwuts)
+  ↓ urls_fetch.R — list all GeoTIFF URLs
+data/urls_list.txt
+  ↓ urls_check_access.py — verify URLs are reachable
+data/urls_access_checks.csv
+  ↓ item_create.py — read metadata, cache it, generate STAC records
+data/stac_geotiff_checks.csv          (cached metadata)
+stac/prod/stac_dem_bc/*.json           (one record per DEM tile)
+stac/prod/stac_dem_bc/collection.json  (collection summary)
+  ↓ item_validate.py — check all records against STAC spec
+data/stac_item_validation.csv
+  ↓ s3_sync.R — push to cloud
+s3://stac-dem-bc/
+  ↓ pgstac registration
+images.a11s.one (searchable API)
+```
+
+## Re-running is Safe
+
+Every step checks for existing outputs and skips work already done. You can re-run after adding new files or fixing a problem without reprocessing everything:
+
+| Step | What gets skipped |
+|------|-------------------|
+| `urls_fetch.R` | Reuses cached `urls_list.txt` in test mode |
+| `urls_check_access.py` | URLs already checked (cached in CSV) |
+| `item_create.py` | GeoTIFFs with cached metadata skip the slow remote read; existing items skip creation |
+| `item_validate.py` | In `--incremental` mode, only validates items added since the last run |
+| `s3_sync.R` | Only uploads new or changed files |
+
+## Run Modes
+
+Most scripts support flags that control scope:
+
+```bash
+# Test mode — process a small sample for development
+python scripts/item_create.py --test --test-count 50
+
+# Incremental — only process new files detected by change detection
+python scripts/item_create.py --incremental
+
+# Reprocess — fix previously invalid items
+python scripts/item_create.py --reprocess-invalid
+
+# Full production — process everything
+python scripts/item_create.py
+```
+
+## Logs
+
+Each pipeline run generates timestamped log files in `logs/`. The naming convention is `YYYYMMDD_HHMMSS_description.log`.
+
+Logs capture configuration, progress, errors, warnings, and timing — making it possible to debug failures after the fact and track performance over time. When a weekly cron job runs unattended, logs are the only record of what happened.
+
+The `build_safe.sh` orchestrator creates a separate log file for each step, so if step 4 fails you can inspect that log without wading through the output of steps 1–3.
+
+## Performance
+
+| Scenario | Time | Notes |
+|----------|------|-------|
+| Full build (58,000 items) | ~5–6 hours | Network I/O bound — reading remote GeoTIFFs for metadata |
+| Incremental update (50 new files) | 5–15 minutes | Reads only new files, builds from cache for the rest |
+| Validation only | ~10 minutes | Local JSON file reads, no network |
+
+The bottleneck is network: each GeoTIFF must be partially read over HTTP to extract its projection, dimensions, and bounds. Once cached, subsequent builds are fast.
+
+## Prerequisites
+
+| Component | What's needed |
+|-----------|---------------|
+| Python | `pystac`, `rio_stac`, `rasterio`, `rio-cogeo`, `pandas`, `tqdm` |
+| R | `ngr` package (for objectstore listing) |
+| AWS CLI | Configured with write access to `s3://stac-dem-bc` |
+| System | `rio` CLI tools (installed with rasterio) |
+
+## After the Pipeline
+
+Once the catalog is on S3, register it in pgstac to make it searchable:
+
+```bash
+ssh root@<VM_IP> "bash /tmp/stac_register-pypgstac.sh stac-dem-bc https://stac-dem-bc.s3.amazonaws.com"
+```
+
+This loads the STAC records into PostgreSQL, powering the search API at `images.a11s.one`. Once registered, the collection is browsable in QGIS (STAC Data Source Manager), through the API directly, or any STAC-compatible client.