YAML-driven workflow for a rapid coffee-suitability and forest-loss analysis over a user-defined area of interest.
The pipeline can:
- generate deterministic synthetic preprocessed inputs for an offline demo;
- download supported real datasets with Python where possible;
- preprocess raw data to a shared monthly
time,lat,longrid; - apply simple coffee-suitability thresholds for NDVI, rainfall, and soil moisture;
- summarize overlap between suitable areas and forest-loss signals;
- write time-series plots, spatial maps, CSV tables, NetCDF outputs, and summary text.
coffee_analysis/
├── create_synthetic_preprocessed_data.py # Offline demo data generator
├── download_data.py # Download supported raw datasets
├── preprocess_data.py # Harmonize raw data to NetCDF
├── run_analysis.py # Run suitability + deforestation analysis
├── validate_data.py # Validate/quarantine corrupt HDF5 files
├── data_pipeline.py # Shared pipeline utilities
├── pipeline_config.yml # Default synthetic demo config
├── pipeline_config.synthetic_preprocessed.yml
├── pipeline_config.real_few_days_l4_download.yml
├── pipeline_config.real_few_days_l4_analysis.yml
├── pipeline_config.real_12mo_download.yml
├── pipeline_config.real_12mo_analysis.yml
├── configs/legacy/ # Older exploratory/example configs
├── data/
│ ├── README.md
│ └── hansen_forest_loss_gee_export.js
└── results/
└── README.md
Large downloaded data and most generated results are intentionally ignored by Git. The small synthetic preprocessed inputs and synthetic showcase results are kept so the framework can be demonstrated immediately.
Use Python 3.9+.
python3 -m pip install -r requirements.txtOn macOS, pyhdf is usually more reliable from conda-forge:
conda install -c conda-forge pyhdf
python3 -m pip install -r requirements.txtThis is the recommended first run because it does not require NASA Earthdata credentials or network downloads.
python3 create_synthetic_preprocessed_data.py --config pipeline_config.yml
python3 run_analysis.py --config pipeline_config.ymlThis creates:
- synthetic preprocessed NetCDFs in
data/synthetic_preprocessed/; - plots, tables, masks, and summaries in
results/synthetic_preprocessed/.
These synthetic showcase files are included in the shared project. You can regenerate them at any time with the two commands above.
The synthetic files match the expected preprocessed format: one NetCDF per variable with time, lat, lon, spatial_ref, and one data variable named ndvi, rainfall, soil_moisture, or forest_loss.
For a tiny real-data smoke test using SMAP L4 over a few days:
python3 download_data.py --config pipeline_config.real_few_days_l4_download.yml
python3 validate_data.py --config pipeline_config.real_few_days_l4_download.yml --dataset soil_moisture
python3 preprocess_data.py --config pipeline_config.real_few_days_l4_download.yml --force
python3 run_analysis.py --config pipeline_config.real_few_days_l4_analysis.yml --force-preprocessFor a full 2023 run:
python3 download_data.py --config pipeline_config.real_12mo_download.yml
python3 validate_data.py --config pipeline_config.real_12mo_download.yml --dataset soil_moisture
python3 preprocess_data.py --config pipeline_config.real_12mo_download.yml --force
python3 run_analysis.py --config pipeline_config.real_12mo_analysis.yml --force-preprocessThe full-year run is much larger because SMAP L4 is sub-daily. Use the few-day run first to confirm credentials and file parsing.
The real-data configs currently use:
- CHIRPS rainfall from a public monthly GeoTIFF URL pattern;
- MODIS monthly NDVI from
MOD13A3.061viaearthaccess; - SMAP L4 surface soil moisture from
SPL4SMGP.008viaearthaccess; - Hansen forest loss from a manual Google Earth Engine export.
Hansen forest loss is not downloaded automatically. Paste data/hansen_forest_loss_gee_export.js into the Google Earth Engine Code Editor, export the lossyear GeoTIFF, then place it at the raw_path used by the YAML config.
MODIS and SMAP downloads use earthaccess. Credentials are not stored in YAML.
The downloader tries:
- a valid
~/.netrc; EARTHDATA_USERNAMEandEARTHDATA_PASSWORD;- an interactive terminal prompt.
Example environment-variable setup:
export EARTHDATA_USERNAME="your_username"
export EARTHDATA_PASSWORD="your_password"Each dataset has an input mode:
preprocessed: use the configured NetCDF file and skip download/preprocess;raw: read local raw files and run preprocessing;download: download supported raw files, then run preprocessing.
The main preprocessing behavior is:
- clip to the AOI and time window from the YAML;
- reproject raw inputs to
analysis.grid.target_crs; - choose a coarsest common grid when
analysis.grid.strategy: "coarsest"; - aggregate to monthly time steps;
- save standardized NetCDF files for the analysis step.
Each preprocessed NetCDF should contain:
- dimensions:
time,lat,lon; - CRS coordinate:
spatial_ref; - one data variable matching the dataset key, such as
ndvi; - monthly timestamps matching the analysis period.
Example:
Dimensions: time, lat, lon
Coordinates: time, lat, lon, spatial_ref
Variables: ndvi
Attributes: input_source_mode, dataset_key, grid_strategy, target_crs
Each analysis run writes a results folder containing:
01_timeseries_trends.png02_spatial_maps.png03_rainfall_soilmoisture_maps.pngtimeseries_indicators.csvannual_forest_loss.csvprocessed_dataset.nccoffee_suitability_mask.npycumulative_forest_loss_map.npyoverlap_mask.npyndvi_trend_map.npyANALYSIS_SUMMARY.txtRESULTS_SUMMARY.txtanalysis_metadata.json
If SMAP preprocessing reports truncated .h5 files:
python3 validate_data.py --config pipeline_config.real_few_days_l4_download.yml --dataset soil_moisture --quarantine-corrupt
python3 download_data.py --config pipeline_config.real_few_days_l4_download.ymlThen rerun preprocessing.
If pyhdf installation fails with pip, install it with conda-forge:
conda install -c conda-forge pyhdfIf a dataset cannot be downloaded or parsed automatically, provide either:
- a preprocessed NetCDF and set
mode: preprocessed; or - local raw files and set
mode: raw.
The maintained workflow is script-first and YAML-driven.