Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
36 changes: 22 additions & 14 deletions .github/workflows/climate-update.yml
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ jobs:
timeout-minutes: 360 # 6 hours max

env:
CDS_API_KEY: ${{ secrets.CDS_API_KEY }}
EDH_TOKEN: ${{ secrets.EDH_TOKEN }}
AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
AWS_DEFAULT_REGION: us-west-2
Expand All @@ -24,26 +24,34 @@ jobs:
use-public-rspm: true

- uses: r-lib/actions/setup-r-dependencies@v2
with:
extra-packages: any::ecmwfr

- name: Configure CDS API
run: |
echo "url: https://cds.climate.copernicus.eu/api" > ~/.cdsapirc
echo "key: $CDS_API_KEY" >> ~/.cdsapirc
chmod 600 ~/.cdsapirc
- name: Install uv (runs the Python backfill)
uses: astral-sh/setup-uv@v5

- name: Set ecmwfr key
- name: Verify required secrets present
run: |
Rscript -e 'ecmwfr::wf_set_key(key = Sys.getenv("CDS_API_KEY"))'

- name: Run update pipeline
missing=()
[ -z "$EDH_TOKEN" ] && missing+=("EDH_TOKEN")
[ -z "$AWS_ACCESS_KEY_ID" ] && missing+=("AWS_ACCESS_KEY_ID")
[ -z "$AWS_SECRET_ACCESS_KEY" ] && missing+=("AWS_SECRET_ACCESS_KEY")
if [ ${#missing[@]} -gt 0 ]; then
echo "ERROR: missing secrets: ${missing[*]}"
exit 1
fi

- name: Run EDH update pipeline
# pipefail so Rscript's exit code propagates through `tee`; default
# bash -e alone would let tee swallow a non-zero R exit.
shell: bash
run: |
set -o pipefail
mkdir -p logs
Rscript scripts/pipeline_update.R 2>&1 | tee logs/update_$(date +%Y%m%d).log
Rscript scripts/pipeline_update_edh.R 2>&1 | tee logs/update_$(date +%Y%m%d).log

# Only commit logs when running on main. workflow_dispatch from a
# feature branch should not rebase onto or push to main.
- name: Commit log
if: always()
if: always() && github.ref == 'refs/heads/main'
run: |
git config user.name "github-actions[bot]"
git config user.email "github-actions[bot]@users.noreply.github.com"
Expand Down
108 changes: 108 additions & 0 deletions planning/active/findings.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,108 @@
# Findings: EDH migration

## EDH API basics (from #35 research and tonight's benchmark)

**Zarr URL:** `https://data.earthdatahub.destine.eu/era5/reanalysis-era5-land-no-antartica-v0.zarr`

**Auth:** Personal access token, passed inline as URL password:
```
https://edh:<EDH_TOKEN>@data.earthdatahub.destine.eu/...
```
Alternative: `storage_options={"client_kwargs":{"trust_env":True}}` to read from `.netrc`.

**Quota:** 500,000 requests/month per user (confirmed in EDH getting-started docs).

**Coord conventions (from live Zarr inspection):**
- Dimensions: `valid_time`, `latitude`, `longitude`
- `latitude`: descending from ~90 to -60 (confirmed by working slice `slice(60, 48)`)
- `longitude`: **0 to 359.9** (0-360 convention, NOT -180 to 180)
- For BC bbox `c(60, -140, 48, -114)` → lon slice must be `220` to `246`
- `valid_time`: hourly, from 1950-01-01T00:00 to current month last day
- Variable names: CF-style short names (`t2m`, `tp`, `d2m`, `swvl1-4`, `u10`, `v10`, etc.)

**50 variables available in single Zarr store:**
`asn, d2m, e, es, evabs, evaow, evatc, evavt, fal, lai_hv, lai_lv, lblt, licd, lict, lmld, lmlt, lshf, ltlt, pev, ro, rsn, sd, sde, sf, skt, slhf, smlt, snowc, sp, src, sro, sshf, ssr, ssrd, ssro, stl1, stl2, stl3, stl4, str, strd, swvl1, swvl2, swvl3, swvl4, t2m, tp, tsn, u10, v10`

**Relevant variable mapping for cd package:**
| cd variable | EDH variable(s) | Units | Notes |
|---|---|---|---|
| tmean | `t2m` | K | Hourly; monthly mean computed client-side |
| tmax | `t2m` | K | Hourly; daily max → monthly mean of daily max |
| tmin | `t2m` | K | Hourly; daily min → monthly mean of daily min |
| prcp | `tp` | m | Hourly total precipitation; sum over month, × 1000 for mm |
| dewpoint | `d2m` | K | Hourly 2m dewpoint |
| soil_moisture | `swvl1`, `swvl2`, `swvl3`, `swvl4` | m³/m³ | 4 vertical layers |
| vpd | derived | hPa | From t2m + d2m (Tetens) |
| rh | derived | % | From t2m + d2m |

## Performance (single month BC bbox)

- Open Zarr store: 2.6-3.0s (metadata only)
- Subset + materialize 1 month × BC bbox × hourly t2m: 14.9-15.9s
- In-memory size: 92.9 MB for 744 hours × 120 lat × 260 lon float64

## Implications for pipeline

- No rate limits means we can pull **all months in parallel** via asyncio/dask if wanted
- Zarr chunking is independent of month boundaries, so small time-slice pulls are efficient
- Can pull all 7 cd variables from the **same dataset open** — no need for separate fetches per variable

## R/Zarr tooling options

- **Reticulate + Python xarray** — proven; what `test_edh_era5_land.py` uses. Simple, works.
- **stars::read_mdim()** — GDAL zarr driver. Requires GDAL built with zarr, auth via URL password likely works.
- **pizzarr** — young R-native zarr package, may not support remote HTTPS stores.

Decision deferred to Phase 3 after Phase 2 unblocks the data.

## Known limitation: UTC-day aggregation for tmax/tmin

Both the existing R pipeline (`pipeline_tmax_tmin_hourly.R`) and the new EDH
Python script compute daily max/min over UTC-day windows, not local-time days.
For BC (UTC-7/-8) the local-afternoon tmax peak straddles UTC day boundaries,
introducing a small systematic bias vs. a local-time daily aggregation.

- CDS's `derived-era5-land-daily-statistics` product accepts a `time_zone`
parameter and would fix this — but we abandoned it in #33 due to its
~2-hour-per-job queue time.
- Fixing properly requires shifting hourly timestamps by the local UTC offset
before `.resample("1D")`, or doing per-pixel local-time aggregation
(computationally expensive and overkill for a regional dataset).
- For now: the EDH script replicates the R pipeline's existing behavior so
outputs are directly comparable. Addressing this bias is a follow-up issue
(cd package methodology).

## Precipitation accumulation (ERA5-Land tp)

The hourly `tp` variable has `GRIB_stepType=accum` — it is a **running
accumulation** from 01:00 UTC reset to 00:00 UTC next day. Naive
`.sum()` over hourly values gives a monthly precip ~8× too high.

**Solution:** use EDH's daily product
`https://data.earthdatahub.destine.eu/era5/era5-land-daily-utc-v1.zarr`
which pre-computes daily totals correctly. Monthly precip = sum of
daily totals.

**Why not fix the hourly sum:** writing accumulation logic ourselves
is bug-prone. The daily product is what ECMWF designed for this
exact case.

**Caveat:** daily product only has 14 variables (t2m, d2m, swvl1-2, tp,
wind, radiation, etc.) — no swvl3 or swvl4. So soil moisture stays on
the hourly product. tmean and dewpoint could use either; we use hourly
for consistency and because the 0.99 correlation (vs CDS) was already
validated.

## EDH product catalogue (what's available for cd package)

| Product | Zarr URL | Variables | Use |
|---|---|---|---|
| Hourly ERA5-Land | `reanalysis-era5-land-no-antartica-v0.zarr` | 50 | tmax/tmin, tmean, dewpoint, soil_moisture |
| Daily ERA5-Land (UTC) | `era5-land-daily-utc-v1.zarr` | 14 | prcp only |
| Monthly ERA5 (not Land) | — | - | NOT useful (ERA5 parent, 31 km) |

## Out-of-scope confirmations

- GEE has ERA5-Land but commercial license blocks us
- AWS / ARCO / WeatherBench all serve ERA5 (31 km), not ERA5-Land
- CDS-beta is same backend, same rate limits
100 changes: 100 additions & 0 deletions planning/active/progress.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,100 @@
# Progress: EDH migration

## Session 2026-04-12 (evening)

**Context from prior day:** Got rate-limited twice by CDS. Researched alternatives (#35), benchmarked EDH Zarr at 5× faster with no rate limits and same data. Filed #36 for migration. Pivoting now.

**Completed:**
- Branched off main: `36-edh-migration`
- Stashed and restored `scripts/test_edh_era5_land.py` on new branch
- Set up PWF files (this document, task_plan.md, findings.md)

**Next:**
- Update climate-update.yml GHA to use EDH (replace CDS fetcher with
scripts/backfill_edh_all.py + stage 3)
- Write pipeline_update_edh (EDH incremental update, replaces
CDS-based pipeline_update.R)
- Land the PR

## Session 2026-04-12 (Stage 3 to S3)

**Completed:**
- Wrote `scripts/pipeline_stage3_edh.R` — slim Stage-3-only against
`data/backfill/monthly/` (avoids CDS fetch in existing pipeline_backfill.R)
- Built 35 COGs locally: 7 variables × (annual + 4 seasons) × 76 years
- STAC catalog built, 35 items
- S3 sync pushed all 35 COGs to s3://stac-era5-land (178 MB total,
replacing the April 6-7 CDS-era COGs)
- Catalog.json unchanged (byte-identical) because filenames + extents
+ metadata didn't change — legitimate sync behavior, not a bug
- Verified end-to-end via /vsicurl: tmean_annual reads back with correct
76 years, CRS, extent. BC annual mean 1950 = -1.42 C, 2024 = +1.85 C
(3.3 C warming in 74 years — the climate departure signal the package
is built to show, now live on S3 from EDH data)

## Session 2026-04-12 (unified backfill run)

**Completed:**
- Full 1950-2025 unified backfill of 5 non-tmax/tmin variables via EDH in ~4 hours
(10:19 - 14:20 wall time)
- 532 monthly TIFs total (7 vars × 76 years), all on matching EPSG:4326 120×260 grid
- CDS-era files safely preserved in `data/backfill/monthly/_cds_backup/` (375 files)
- Two transient errors handled well:
- 1989: TimeoutError → retry-with-backoff caught it, completed on attempt 2
- 2008: ClientPayloadError → retry catch didn't include aiohttp class; outer handler
skipped the year and continued. Filled later via `--year 2008` rerun.
- QA confirmed: all grids aligned, tmin<=tmean<=tmax with zero violations,
physically plausible values, soil_moisture composite correct.

## Session 2026-04-12 (overnight run)

**Completed:**
- Full 1950-2025 backfill via EDH, ~1h 53min unattended (02:16 - 04:09)
- 75 year-files written (1951-2025) plus 1950 from earlier test = 76 years
- Avg ~90s per year, no failures
- Regenerated 2024 from EDH (was a CDS leftover from April 7 with different methodology)
- Filed #37 for the UTC-day aggregation bias (not blocking #36)
- All 76 years × 2 variables × 12 months × BC bbox GeoTIFFs in data/backfill/monthly/

**Validation (R/terra spot check):**
- 1950 Jul tmax up to 29.7 °C — historically realistic
- 2000 Jul tmax up to 28.9 °C
- 2024 Jul tmax up to 34.6 °C — matches heat dome era
- 2025 Jul tmax up to 31.9 °C
- All years: 12 layers, Jan..Dec names, EPSG:4326, BC bbox extent

## Session 2026-04-12 (morning QA)

**Completed:**
- Wrote `scripts/qa_monthly.R` — checks grid alignment, time coverage,
physical sanity across all variables
- Wrote `scripts/probe_edh_vars.py` — pulls one month of each EDH
variable, compares to CDS, validates aggregation semantics
- **Discovered grid mismatch:** CDS-era vars on 121×261 grid with no CRS
tag, vs EDH-era tmax/tmin on 120×260 EPSG:4326. Blocks release.
- **Validated EDH semantics:**
- tmean (hourly t2m mean): 0.99 correlation with CDS, 0.43 °C mean abs diff
- dewpoint: plausible BC winter values
- soil_moisture: all 4 depths match CDS composite closely
- prcp: hourly `tp` is accumulation (`GRIB_stepType=accum`) — need
EDH daily product for correct monthly precip
- **Decision:** two-Zarr approach, extend backfill to all variables so
the whole dataset is internally grid-aligned

**Next:**
- Commit QA + probe scripts on the branch (durable tooling)
- Extend backfill script to all 7 cd variables
- Regenerate `data/backfill/monthly/` entirely
- Re-run QA → R Stage 3 → PR

**Commits this session:**
- `0014bf2` — Add planning docs for EDH migration
- `7ef03cb` — Add EDH Zarr benchmark test script
- _pending_ — Add EDH-based tmax/tmin backfill script

**Verified 2026-04-12 02:11:**
- EDH Zarr benchmark: 15.9s / month BC t2m (5× faster than CDS)
- EDH full-year backfill: 114.3s / year BC tmax/tmin as GeoTIFF
- Output drop-in replacement for existing R Stage 2: 12 layers Jan..Dec,
EPSG:4326, °C, BC bbox — `terra::rast()` reads cleanly with expected names
- Realistic BC values for 1950: Jan tmax -30.5 to 0.7°C, Jul tmax -1.2 to 29.7°C
88 changes: 88 additions & 0 deletions planning/active/task_plan.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,88 @@
# Task Plan: Migrate cd_fetch() to DestinE Earth Data Hub Zarr

Issue: NewGraphEnvironment/cd#36
Branch: `36-edh-migration`

## Context

CDS (`ecmwfr`) rate limiting makes the tmax/tmin backfill take ~3 days of babysitting. Benchmark in #35 showed EDH Zarr delivers the same ERA5-Land data at ~15s/month (5× faster, no rate limits, 500K requests/month quota). Pivot the producer pipeline to EDH.

## Phase 1: Scaffold and benchmark (mostly done from #35)

- [x] Confirm EDH carries ERA5-Land at 9 km native (validated #35)
- [x] Confirm temporal coverage 1950 to present (validated #35)
- [x] Confirm commercial license (CC-BY 4.0, validated #35)
- [x] Write portable PEP 723 test script (`scripts/test_edh_era5_land.py`)
- [x] Bench one month BC t2m — 15.9s vs CDS 80s
- [x] Commit test script on the branch

## Phase 2: Pragmatic Python backfill (unblock #33)

The quickest path to finishing the tmax/tmin data: a Python script that uses EDH directly.
Runs outside R, produces monthly GRIB or NetCDF that downstream R stages already consume.

- [x] Write `scripts/backfill_edh_tmax_tmin.py` — tmax/tmin only for now (unblocks #33)
- [x] Output matches existing R pipeline's Stage 2 format (yearly `.tif` × 12 month bands, °C, EPSG:4326)
- [x] Band descriptions Jan..Dec so `cd_aggregate()` seasonal grouping works
- [x] Idempotent — skip years where both output files exist
- [x] Test on one year (1950) — 114s, realistic values, terra reads correctly
- [x] Run full backfill 1950-2025 — completed in ~1h 53min (76 years × tmax + tmin)
- [x] Regenerate 2024 from EDH (was a CDS leftover) for homogeneous methodology
- [x] QA — discovered CDS-era vars on a different grid (ext shifted 0.1°, CRS missing, 121×261 vs EDH 120×260)
- [x] Probe EDH products — confirmed two-Zarr approach: hourly for state vars, daily for prcp

## Phase 2b: Unified backfill (all variables on EDH grid)

Grid mismatch between CDS-era vars and EDH-era tmax/tmin blocks release.
Regenerating everything from EDH gives one internally-consistent dataset.

- [x] Extend backfill to all 7 cd variables (scripts/backfill_edh_all.py)
- [x] Regenerate all `data/backfill/monthly/*_YYYY.tif` with consistent grid + CRS
- [x] Re-run QA — all grids aligned, all CRS tagged, tmin<=tmean<=tmax sanity passes
- [x] VPD/RH derived in Python (Tetens), no R cd_derive re-run needed

**QA summary 2026-04-12:**
- 7 variables × 76 years = 532 monthly TIFs
- All on 120×260 EPSG:4326 grid, extent [-139.95, -113.95, 47.95, 59.95]
- Zero tmin>tmax or tmean inversion violations (163,888 cell-checks)
- (tmax+tmin)/2 vs tmean mean diff = 0.57°C (classical climatology shortcut bias, normal)
- 2008 had a ClientPayloadError that my retry didn't catch (wrong exception class);
re-ran --year 2008 to fill the gap. Filed as future improvement in #38.

## Phase 2c: R Stage 3

- [x] Run R stage 3 (COG + STAC + S3) against the unified EDH-generated TIFs
(scripts/pipeline_stage3_edh.R; 35 COGs live on s3://stac-era5-land,
verified via /vsicurl read)

## Phase 3: R integration for cd_fetch()

- [ ] Decide: reticulate + Python xarray, OR stars::read_mdim() via GDAL zarr driver
- [ ] Prototype both on one month, compare simplicity and performance
- [ ] Refactor `cd_fetch()` with `source = c("edh", "cds")` parameter, default `"edh"`
- [ ] Keep `cd_fetch()` CDS path for fallback
- [ ] Update tests to exercise both paths
- [ ] Update examples and docs

## Phase 4: Pipeline and docs

- [ ] Update `scripts/pipeline_backfill.R` to use EDH source
- [ ] Update monthly GitHub Action to use EDH
- [ ] Add `EDH_TOKEN` to repo secrets (rotate current token first)
- [ ] Update CLAUDE.md — CDS section → EDH primary, CDS fallback
- [ ] Update README / pkgdown auth setup instructions

## Phase 5: Close out

- [ ] Run full backfill from scratch via integrated R path — validate output matches phase 2
- [ ] Close #33 (tmax/tmin saga resolved by EDH)
- [ ] Close #35 (evaluation → migration done)
- [ ] Merge PR closing #36
- [ ] Archive planning to `planning/completed/`

## Success criteria

- `cd_fetch()` default path uses EDH; CDS path works as fallback
- Full 1950-2025 backfill completes in under 8 hours (single session, unattended)
- Vignette and tests pass with EDH as source
- CDS API knowledge preserved in CLAUDE.md but marked as fallback
Loading
Loading