Implement variance-based global sensitivity Analysis (Sobol) pipeline by divine7022 · Pull Request #2 · ccmmf/uncertainty

divine7022 · 2025-11-26T03:20:48Z

Summary

Add a complete multisite variance-based global sensitivity analysis (GSA) pipeline based on Saltelli sampling and Sobol indices (sensobol). This PR implements generation of sobol designs that include parameters and drivers (IC, met) using the discrete-mapping approach recommended in the sensobol documentation, converts designs to pecan inputs, launches model runs, computes sobol indices, and adds a quarto report to visualize and summarize results.

This work implements the requirements of issue #152 (Multisite global (sobol) sensitivity) and lays the groundwork for Issue #153 (Variance decomposition / uncertainty partitioning).

scripts/021_generate_sobol_design.R --> orchestrates sobol design creation and saves design + metadata. Adds the dummy parameter for noise baselining.
R/global_sensitivity.R --> core function generate_sobol_design() (saltelli matrices via sensobol) with:
- inclusion of ic_ensemble and met_ensemble as sampled inputs,
- continuous -- prior quantile transforms for model parameters,
- continuous -- discrete mapping for IC/met using floor(qunif(..., min, max+1)) per sensobol examples.
scripts/022_prepare_pecan_inputs.R --> converts sobol design to pecan trait.samples / input_design and saves samples.Rdata + cache/input_design.rds.
scripts/023_run_global_sensitivity.R --> run wrapper that reads input_design and runs pecan workflow (note: script forces settings$ensemble$size <- 1 to ensure single-run per design row).
scripts/024_compute_sobol_indices.R --> loads ensemble outputs, reconstructs Y, computes sobol indices with sensobol::sobol_indices() (bootstrapped), and saves data/sobol_indices.csv.
analysis/global_sensitivity.qmd --> quarto analysis/report to visualize first-order and total-order indices, interactions, variance partitioning (Biology vs Environment), fixable parameters using dummy/noise, and additivity checks.
analysis/ (artifact files) --> report .qmd / .html and supporting files.

dlebauer · 2026-02-03T21:40:14Z

R/global_sensitivity.R

+#' @param ic_range Integer vector of available IC ensemble IDs. Should match
+#'   the number of IC files in settings XML (e.g. 1:20, 1:100).
+#' @param met_range Integer vector of available met ensemble IDs. Should match
+#'   the number of met files in settings XML (e.g. 1:10).


should ic_range and met_range be required to be a sequential integer starting from 1? If so, would it be sufficient to have ic_size and met_size representing the size of the ensemble?

Copilot

Pull request overview

This PR implements a comprehensive multisite variance-based global sensitivity analysis (GSA) pipeline using Sobol indices for the SIPNET ecosystem model. The pipeline integrates Saltelli sampling for design generation, PEcAn workflow orchestration, management event generation with crop-specific fertilization rates, and variance decomposition analysis with visualization.

Changes:

Added complete Sobol global sensitivity workflow (scripts 021-025) that generates quasi-random designs including parameters and environmental drivers (IC, met), converts to PEcAn format, executes model runs, and computes first/total-order Sobol indices
Implemented management event generation system (script 023, R/management_events.R, R/crop_lookup.R) that maps quantile-based samples to crop-specific N fertilizer and compost rates using California agricultural databases
Created Quarto analysis report (analysis/global_sensitivity.qmd) with variance decomposition visualizations, parameter ranking, interaction analysis, and factor screening using dummy parameter baseline

Reviewed changes

Copilot reviewed 17 out of 17 changed files in this pull request and generated 24 comments.

Show a summary per file

File	Description
scripts/sge_array_launcher.sh	SGE job array launcher for parallel model runs
scripts/run_pipeline.sh	Master orchestration script for local SA → global SA → variance decomposition
scripts/run_local_sa.sh	Local sensitivity (OAT) pipeline wrapper
scripts/run_global_sa.sh	Global sensitivity (Sobol) pipeline wrapper calling 021-025
scripts/021_generate_sobol_design.R	Generates Saltelli design matrices with params, IC, met using sensobol
scripts/022_prepare_pecan_inputs.R	Converts Sobol design to PEcAn samples.Rdata format
scripts/023_generate_management_events.R	Maps quantiles to crop-specific management events (N, compost)
scripts/024_run_global_sensitivity.R	PEcAn workflow execution with per-sample events registration
scripts/025_compute_sobol_indices.R	Computes bootstrapped Sobol indices from ensemble outputs
scripts/002_build_xml.R	Modified to set PFT posteriors and ensemble paths at config time
R/global_sensitivity.R	Core functions: generate_sobol_design(), compute_sobol_indices()
R/management_events.R	Event building functions for fertilization, with stubs for tillage/planting/harvest
R/crop_lookup.R	Joins LandIQ crop data with PEcAn lookup tables for site-specific N rates
analysis/global_sensitivity.qmd	Comprehensive Sobol analysis report with visualizations
000-config.yml	Added sobol config section and crop lookup paths
data_raw/template.xml	Added events input, sensitivity.analysis block; changed qsub from SLURM to SGE
README.md	Updated with pipeline structure and Phase 2 script documentation

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-02-27T01:22:41Z

R/global_sensitivity.R

+    PEcAn.logger::logger.severe("'ic_size' and 'met_size' must be >= 1")
+  }
+
+  set.seed(42)


The seed is hard-coded to 42 in the function generate_sobol_design(). This makes all Sobol designs reproducible but prevents users from running multiple independent designs for validation or ensemble purposes. Consider adding a seed parameter with a default value of 42, or documenting this limitation prominently. For sensitivity analysis reproducibility is important, but users should be aware they cannot easily generate independent replicates.

good point! but for this pipeline reproducibility is the primary concern; we may add a seed parameter in a future iteration if independent replicates are needed

R/global_sensitivity.R

Copilot · 2026-02-27T01:22:42Z

scripts/023_generate_management_events.R

+    event_config    = cfg$event_config %||% list()
+  )
+
+  # write JSON (validated by PEcAn.data.land::validate_events_json)


The comment mentions validation by PEcAn.data.land::validate_events_json, but the code doesn't actually call this validation function. If validation is important (which it likely is for events), either call the function or update the comment to accurately reflect that validation is not performed here.

Suggested change

# write JSON (validated by PEcAn.data.land::validate_events_json)

# write events JSON for this sample

Copilot · 2026-02-27T01:22:42Z

scripts/024_run_global_sensitivity.R

+input_design <- readRDS("cache/input_design.rds")
+
+status_file <- file.path(settings$outdir, "STATUS")
+if (!args$continue && file.exists(status_file)) file.remove(status_file)


The script removes the status file with file.remove(status_file) when --continue is not specified but doesn't check if the file exists first. While file.remove() returns FALSE silently if the file doesn't exist, this could mask issues. Consider checking file.exists(status_file) before removal, or wrap in a tryCatch to handle potential permission errors more gracefully.

file.exists() check is already present in if condition on the same line

Copilot · 2026-02-27T01:22:42Z

000-config.yml

+    pecan_outdir: "ccmmf_dir/modelout/ccmmf_phase_2b_mixed_pfts_20250701"
+    master_design_points: "ccmmf_dir/data/design_points.csv"
  pecan_xml_template: "data_raw/template.xml"
  sites:
    design_points_file: "data_raw/sa_design_points.csv"
+    n_sample: 10
+  sensitivity:
+    sigma_levels: [-2, -1, 1, 2] # [-3, -2, -1, 1, 2, 3] # Standard
+  # Or use [-2, -1, 1, 2] for faster runs
+  # ensemble:
+  #   size: 20
+  #   n_met: 10
+  # run:
+  #   start_date: "2016-01-01"
+  #   end_date: "2023-12-31"
  projection:
-    ca_albers_crs: 3310
+    ca_albers_crs: 3310
+  # management uncertainty (023)
+  start_year: 2016
+  end_year: 2023
+  site_id: "1000025731"
+  # crop identity -> N rate mapping (021 + 023)
+  crop_lookup:
+    landiq_parquet: "ccmmf_dir/LandIQ-harmonized-v3/crops_all_years.parq"
+    pft_table_csv: "ccmmf_dir/cadwr-landuse/data/CARB_PFTs_table.csv"
+    crosswalk_csv: "ccmmf_dir/management/fertilization/crop_type_crosswalk.csv"


The paths under lines 10, 11, and 33-35 contain literal string "ccmmf_dir" instead of using YAML anchors or substitution. These will be interpreted as literal directory names like "ccmmf_dir/modelout/..." rather than substituting the value from line 6. YAML does not perform variable substitution by default. Consider using: (1) YAML anchors and aliases, (2) a templating system, or (3) code-level path resolution using file.path(cfg$paths$ccmmf_dir, "modelout/..."). The current configuration will fail at runtime.

good catch! ccmmf_dir/... paths on L10-11 are unused by the global SA pipeline (those are local SA keys);

should be fixed.

Copilot · 2026-02-27T01:22:46Z

scripts/021_generate_sobol_design.R

+  design_points_csv = cfg$sites$design_points_file %||%
+    "data_raw/design_points_198.csv",
+  landiq_parquet = cfg$crop_lookup$landiq_parquet %||%
+    "/projectnb2/dietzelab/ccmmf/LandIQ-harmonized-v3/crops_all_years.parq",
+  pft_table_csv = cfg$crop_lookup$pft_table_csv %||%
+    "/projectnb2/dietzelab/abv1/ccmmf/cadwr-landuse/data/CARB_PFTs_table.csv",
+  crosswalk_csv = cfg$crop_lookup$crosswalk_csv %||%
+    "/projectnb2/dietzelab/ccmmf/management/fertilization/crop_type_crosswalk.csv",
+  year   = as.integer(cfg$crop_lookup$landiq_year %||% 2023L), # (most recent)
+  season = as.integer(cfg$crop_lookup$landiq_season %||% 2L)
+  # NB crop identity is assumed constant across simulation years --
+  # this is a simplification for annual rotations.
+  # TODO use per-year LandIQ when rotation data is available
+)


The script uses hard-coded file paths with ccmmf_dir prefix in the config defaults (lines 216-220). These paths appear to be system-specific (e.g., "/projectnb2/dietzelab/ccmmf/..."). If the ccmmf_dir is not properly set in the config, the script will use these hard-coded paths which may not exist on other systems. Consider either removing the defaults or making them relative paths, and documenting the required config structure.

Copilot · 2026-02-27T01:22:47Z

R/global_sensitivity.R

+        PEcAn.logger::logger.severe(sprintf(
+          "variable %s for runid %s has %d values (expected %d)",
+          v, rid, length(Y), expected_len
+        ))


The function expects exactly N * (length(params) + 2) values in Y and will raise a severe error otherwise. However, if runs fail or are incomplete, this will cause the entire computation to abort. Consider adding an option to handle incomplete results, perhaps by padding with NA or excluding incomplete sites, especially since the comment in line 98 of 024_run_global_sensitivity.R mentions stop.on.error = FALSE. The current strict validation is inconsistent with allowing runs to fail.

Suggested change

PEcAn.logger::logger.severe(sprintf(

"variable %s for runid %s has %d values (expected %d)",

v, rid, length(Y), expected_len

))

PEcAn.logger::logger.warn(sprintf(

"variable %s for runid %s has %d values (expected %d) -- skipping",

v, rid, length(Y), expected_len

))

next

for sobol SA, all N*(k+2) runs are mathematically required by the Saltelli estimator. Skipping incomplete runs would produce incorrect indices without warning. The strict validation is intentional and consistent with the stop.on.error = FALSE in 024, which allows the MODEL step to continue but ensures we catch failures at the INDEX COMPUTATION step

Copilot · 2026-02-27T01:22:47Z

data_raw/template.xml

  <delete.raw>FALSE</delete.raw>
-  <binary>sipnet.git</binary>
-  <prerun>cp data/events.in @RUNDIR@</prerun>
+  <binary>sipnet</binary>


The model binary was changed from "sipnet.git" to "sipnet". Ensure that the PEcAn installation has the "sipnet" binary in the expected location. If "sipnet.git" was a specific build or version identifier, this change might cause runtime failures if the binary is not found. Consider documenting why this change was made.

Suggested change

<binary>sipnet</binary>



<binary>sipnet.git</binary>

it's just a sys link, binary can renamed to any

Copilot · 2026-02-27T01:22:47Z

data_raw/template.xml

  <delete.raw>FALSE</delete.raw>
-  <binary>sipnet.git</binary>
-  <prerun>cp data/events.in @RUNDIR@</prerun>
+  <binary>sipnet</binary>


The prerun command cp data/events.in @RUNDIR@ was removed. This suggests events are now handled differently (via the new events input registration in setEnsemblePaths). However, if any legacy workflows or documentation reference this prerun command, they will need updating. Ensure this change is documented and that the new events mechanism is fully functional before removing the fallback.

removal of <prerun> is intentional and is the core architectural change in this PR. Events are now registered via setEnsemblePaths(), write.configs.SIPNET copies inputs$events$path into each run directory automatically. The old cp prerun was a workaround

Copilot · 2026-02-27T01:22:47Z

scripts/022_prepare_pecan_inputs.R

+
+# config
+cfg <- yaml::read_yaml(args$config)
+settings <- PEcAn.settings::read.settings(cfg$settings_xml)


The script reads cfg$settings_xml directly on line 37 without checking if it exists or providing a fallback. If this key is missing from the config, the script will fail with an unclear error when trying to read NULL as a file path. Add a validation check similar to 021_generate_sobol_design.R, or provide a default value.

if NULL, read.settings(NULL) will throw a clear error; but noted..

…sing points from geo

divine7022 · 2026-03-05T09:26:37Z

global SA report of mvp runs -- https://drive.google.com/file/d/15NEWLViccNnGDx4ooy33Dq-NrCKdZ3j8/view?usp=sharing

Copilot

Pull request overview

Copilot reviewed 18 out of 18 changed files in this pull request and generated 8 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-03-05T18:38:16Z

R/global_sensitivity.R

+    PEcAn.logger::logger.severe("'ic_size' and 'met_size' must be >= 1")
+  }
+
+  set.seed(42)


The seed is hard-coded to 42 with no way to override it, making reproducibility invisible to callers and preventing legitimate sensitivity-to-seed studies. Consider adding a seed parameter (e.g., seed = 42L) to generate_sobol_design() so callers can control or document the seed explicitly.

Copilot · 2026-03-05T18:38:16Z

R/global_sensitivity.R

+                                  params,
+                                  N,
+                                  R = 500L) {
+


The expected output length formula N * (length(params) + 2) matches the Saltelli scheme used in generate_sobol_design(), which produces N * (k + 2) rows (A, B, and k AB matrices). However, params passed here already includes ic_ensemble and met_ensemble (appended in script 025), so length(params) equals k (all inputs including IC/met). This is consistent with design generation. That said, compute_sobol_indices accepts a generic params argument, and if a caller accidentally passes only the continuous params (without IC/met), the length check would silently use the wrong expected length. Consider validating or documenting that params must include IC/met entries to match the design.

Suggested change

# Validate that 'params' matches the Sobol design used to generate outputs.

# It must be the full parameter set, including 'ic_ensemble' and 'met_ensemble'.

if (!is.character(params) || length(params) == 0L) {

PEcAn.logger::logger.severe(

"'params' must be a non-empty character vector of parameter names (including 'ic_ensemble' and 'met_ensemble')."

)

}

required_params <- c("ic_ensemble", "met_ensemble")

if (!all(required_params %in% params)) {

PEcAn.logger::logger.severe(

paste0(

"'params' must include all parameter names used in the Sobol design, ",

"specifically: ", paste(required_params, collapse = ", "),

". Received params: ", paste(params, collapse = ", ")

)

)

}

Copilot · 2026-03-05T18:38:16Z

tools/023_submit_events_array.sh

+
+set -euo pipefail
+
+PROJ_DIR="/projectnb/dietzelab/abv1/ccmmf/uncertainty"


PROJ_DIR is hard-coded to a personal project directory path. This will not work for other users or CI environments without manual editing. It should be derived dynamically (e.g., via $(pwd) or $(dirname "$(realpath "$0")")/..) or passed as a parameter, consistent with how CONFIG is handled in the other shell scripts.

Suggested change

PROJ_DIR="/projectnb/dietzelab/abv1/ccmmf/uncertainty"

SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]:-$0}")" && pwd)"

PROJ_DIR="${PROJ_DIR:-$(cd "${SCRIPT_DIR}/.." && pwd)}"

Copilot · 2026-03-05T18:38:17Z

scripts/022_prepare_pecan_inputs.R

+if (is.null(settings_xml) || !file.exists(settings_xml)) {
+  PEcAn.logger::logger.severe(
+    "Settings XML not found: ", settings_xml,
+    ".Check 'settings_xml' in ", args$config


Missing space before "Check" in the error message.

Suggested change

".Check 'settings_xml' in ", args$config

". Check 'settings_xml' in ", args$config

Copilot · 2026-03-05T18:38:17Z

scripts/023_generate_management_events.R

+baseline_events <- if (!is.null(baseline_raw$events)) baseline_raw$events else baseline_raw
+
+# --- load anchor site events (monitoring framework) ---
+# site keyed JSON with per type events for 17 anchor sites (defered scaling up).


"defered" should be "deferred".

Suggested change

# site keyed JSON with per type events for 17 anchor sites (defered scaling up).

# site keyed JSON with per type events for 17 anchor sites (deferred scaling up).

Copilot · 2026-03-05T18:38:17Z

R/global_sensitivity.R

+}
+
+.data <- rlang::.data


Assigning rlang::.data to a top-level variable .data in a source file is an anti-pattern. It pollutes the global namespace and shadows the .data pronoun used within dplyr verbs. The correct approach is to import it via #' @importFrom rlang .data in a package context, or simply reference it as rlang::.data inline. Since this is a sourced script (not a package), the cleanest fix is to remove this line and use .data directly in the dplyr calls (where it is already recognized) or qualify it as rlang::.data.

Suggested change

}

.data <- rlang::.data

}

Copilot · 2026-03-05T18:38:17Z

scripts/024_run_global_sensitivity.R

+
+# --- config ---
+cfg <- yaml::read_yaml(args$config)
+settings_xml <- args$settings %||% cfg$settings_xml %||% "data_raw/settings_sa.xml"


cfg is read with yaml::read_yaml(args$config), which returns a nested list where the settings XML is stored at cfg$default$settings_xml (as seen in 021_generate_sobol_design.R and 022_prepare_pecan_inputs.R). Accessing cfg$settings_xml (top-level) will always return NULL, causing this script to always fall through to the hard-coded default path "data_raw/settings_sa.xml" rather than honoring the config file value.

Suggested change

settings_xml <- args$settings %||% cfg$settings_xml %||% "data_raw/settings_sa.xml"

settings_xml <- args$settings %||% cfg$default$settings_xml %||% cfg$settings_xml %||% "data_raw/settings_sa.xml"

Copilot · 2026-03-05T18:38:18Z

README.md

 │   └── template.xml
 ├── scripts/
 │   ├── 001_setup_design_points.R
+|   ├── 002_build_xml.R


The directory tree uses | (pipe) as the leading character on lines 68, 72, and 73, inconsistent with the │ (box-drawing character) used on all other tree lines. This causes the tree to render incorrectly.

Suggested change

| ├── 002_build_xml.R

│ ├── 002_build_xml.R

000-config.yml

dlebauer

This is an excellent first pass at the analysis. This is the first half of my review; I have reviewed the analysis and R/*R code, and a few of the scripts. I will complete my review of the scripts/*R, tools, and README.md later.

In slack, @mdietze suggested using the existing PEcAn input type level global SA. That makes sense, but can be deferred and does not need to be addressed in this PR.

Requested changes are focused on correctness, clarity, and internal consistency.

Please defer anything that is not essential for MVP and correctness (including requested changes) by creating one or more follow up issues and/or a TODO file.

Changes requested:

Rename 'biological parameters' --> 'model parameters' or simply 'parameters' throughout. SIPNET parameters are not all strictly 'biological'. Also update other uses of the biological parameters concept, including bio_frac and interpretation.
Make sure that management uncertainty is included in analyses (results, interpretation, captions). It is present in analysis code but omitted/obscured in descriptions.
Make it clear when results refer to 'first order (main-effect) variance'. Several places read as if Si or sum(Si) represents total variance.
Disambiguate type-level vs per-input analysis; add clarification to README.

Additional requested fixes

Use "input" where appropriate rather than overloading "parameter". In several places ranked objects include drivers and mgmt inputs not just model parameters

Future work:

A separate type-level analysis using existing PEcAn Sobol functionality. This can be positioned upstream of this more detailed analysis.
coordinate with @ashiklom on functions in R/management_events.R.

dlebauer · 2026-03-16T20:35:03Z