Skip to content

Implement variance-based global sensitivity Analysis (Sobol) pipeline#2

Open
divine7022 wants to merge 41 commits intoccmmf:mainfrom
divine7022:global-SA
Open

Implement variance-based global sensitivity Analysis (Sobol) pipeline#2
divine7022 wants to merge 41 commits intoccmmf:mainfrom
divine7022:global-SA

Conversation

@divine7022
Copy link
Collaborator

Summary

Add a complete multisite variance-based global sensitivity analysis (GSA) pipeline based on Saltelli sampling and Sobol indices (sensobol). This PR implements generation of sobol designs that include parameters and drivers (IC, met) using the discrete-mapping approach recommended in the sensobol documentation, converts designs to pecan inputs, launches model runs, computes sobol indices, and adds a quarto report to visualize and summarize results.

This work implements the requirements of issue #152 (Multisite global (sobol) sensitivity) and lays the groundwork for Issue #153 (Variance decomposition / uncertainty partitioning).


  • scripts/021_generate_sobol_design.R --> orchestrates sobol design creation and saves design + metadata. Adds the dummy parameter for noise baselining.
  • R/global_sensitivity.R --> core function generate_sobol_design() (saltelli matrices via sensobol) with:
    • inclusion of ic_ensemble and met_ensemble as sampled inputs,
    • continuous -- prior quantile transforms for model parameters,
    • continuous -- discrete mapping for IC/met using floor(qunif(..., min, max+1)) per sensobol examples.
  • scripts/022_prepare_pecan_inputs.R --> converts sobol design to pecan trait.samples / input_design and saves samples.Rdata + cache/input_design.rds.
  • scripts/023_run_global_sensitivity.R --> run wrapper that reads input_design and runs pecan workflow (note: script forces settings$ensemble$size <- 1 to ensure single-run per design row).
  • scripts/024_compute_sobol_indices.R --> loads ensemble outputs, reconstructs Y, computes sobol indices with sensobol::sobol_indices() (bootstrapped), and saves data/sobol_indices.csv.
  • analysis/global_sensitivity.qmd --> quarto analysis/report to visualize first-order and total-order indices, interactions, variance partitioning (Biology vs Environment), fixable parameters using dummy/noise, and additivity checks.
  • analysis/ (artifact files) --> report .qmd / .html and supporting files.

Comment on lines +21 to +24
#' @param ic_range Integer vector of available IC ensemble IDs. Should match
#' the number of IC files in settings XML (e.g. 1:20, 1:100).
#' @param met_range Integer vector of available met ensemble IDs. Should match
#' the number of met files in settings XML (e.g. 1:10).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should ic_range and met_range be required to be a sequential integer starting from 1? If so, would it be sufficient to have ic_size and met_size representing the size of the ensemble?

@divine7022 divine7022 requested a review from dlebauer February 27, 2026 00:56
@dlebauer dlebauer requested review from Copilot and removed request for dlebauer February 27, 2026 01:14
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR implements a comprehensive multisite variance-based global sensitivity analysis (GSA) pipeline using Sobol indices for the SIPNET ecosystem model. The pipeline integrates Saltelli sampling for design generation, PEcAn workflow orchestration, management event generation with crop-specific fertilization rates, and variance decomposition analysis with visualization.

Changes:

  • Added complete Sobol global sensitivity workflow (scripts 021-025) that generates quasi-random designs including parameters and environmental drivers (IC, met), converts to PEcAn format, executes model runs, and computes first/total-order Sobol indices
  • Implemented management event generation system (script 023, R/management_events.R, R/crop_lookup.R) that maps quantile-based samples to crop-specific N fertilizer and compost rates using California agricultural databases
  • Created Quarto analysis report (analysis/global_sensitivity.qmd) with variance decomposition visualizations, parameter ranking, interaction analysis, and factor screening using dummy parameter baseline

Reviewed changes

Copilot reviewed 17 out of 17 changed files in this pull request and generated 24 comments.

Show a summary per file
File Description
scripts/sge_array_launcher.sh SGE job array launcher for parallel model runs
scripts/run_pipeline.sh Master orchestration script for local SA → global SA → variance decomposition
scripts/run_local_sa.sh Local sensitivity (OAT) pipeline wrapper
scripts/run_global_sa.sh Global sensitivity (Sobol) pipeline wrapper calling 021-025
scripts/021_generate_sobol_design.R Generates Saltelli design matrices with params, IC, met using sensobol
scripts/022_prepare_pecan_inputs.R Converts Sobol design to PEcAn samples.Rdata format
scripts/023_generate_management_events.R Maps quantiles to crop-specific management events (N, compost)
scripts/024_run_global_sensitivity.R PEcAn workflow execution with per-sample events registration
scripts/025_compute_sobol_indices.R Computes bootstrapped Sobol indices from ensemble outputs
scripts/002_build_xml.R Modified to set PFT posteriors and ensemble paths at config time
R/global_sensitivity.R Core functions: generate_sobol_design(), compute_sobol_indices()
R/management_events.R Event building functions for fertilization, with stubs for tillage/planting/harvest
R/crop_lookup.R Joins LandIQ crop data with PEcAn lookup tables for site-specific N rates
analysis/global_sensitivity.qmd Comprehensive Sobol analysis report with visualizations
000-config.yml Added sobol config section and crop lookup paths
data_raw/template.xml Added events input, sensitivity.analysis block; changed qsub from SLURM to SGE
README.md Updated with pipeline structure and Phase 2 script documentation

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

PEcAn.logger::logger.severe("'ic_size' and 'met_size' must be >= 1")
}

set.seed(42)
Copy link

Copilot AI Feb 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The seed is hard-coded to 42 in the function generate_sobol_design(). This makes all Sobol designs reproducible but prevents users from running multiple independent designs for validation or ensemble purposes. Consider adding a seed parameter with a default value of 42, or documenting this limitation prominently. For sensitivity analysis reproducibility is important, but users should be aware they cannot easily generate independent replicates.

Copilot uses AI. Check for mistakes.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point! but for this pipeline reproducibility is the primary concern; we may add a seed parameter in a future iteration if independent replicates are needed

event_config = cfg$event_config %||% list()
)

# write JSON (validated by PEcAn.data.land::validate_events_json)
Copy link

Copilot AI Feb 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment mentions validation by PEcAn.data.land::validate_events_json, but the code doesn't actually call this validation function. If validation is important (which it likely is for events), either call the function or update the comment to accurately reflect that validation is not performed here.

Suggested change
# write JSON (validated by PEcAn.data.land::validate_events_json)
# write events JSON for this sample

Copilot uses AI. Check for mistakes.
input_design <- readRDS("cache/input_design.rds")

status_file <- file.path(settings$outdir, "STATUS")
if (!args$continue && file.exists(status_file)) file.remove(status_file)
Copy link

Copilot AI Feb 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The script removes the status file with file.remove(status_file) when --continue is not specified but doesn't check if the file exists first. While file.remove() returns FALSE silently if the file doesn't exist, this could mask issues. Consider checking file.exists(status_file) before removal, or wrap in a tryCatch to handle potential permission errors more gracefully.

Copilot uses AI. Check for mistakes.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

file.exists() check is already present in if condition on the same line

000-config.yml Outdated
Comment on lines +10 to +35
pecan_outdir: "ccmmf_dir/modelout/ccmmf_phase_2b_mixed_pfts_20250701"
master_design_points: "ccmmf_dir/data/design_points.csv"
pecan_xml_template: "data_raw/template.xml"
sites:
design_points_file: "data_raw/sa_design_points.csv"
n_sample: 10
sensitivity:
sigma_levels: [-2, -1, 1, 2] # [-3, -2, -1, 1, 2, 3] # Standard
# Or use [-2, -1, 1, 2] for faster runs
# ensemble:
# size: 20
# n_met: 10
# run:
# start_date: "2016-01-01"
# end_date: "2023-12-31"
projection:
ca_albers_crs: 3310 No newline at end of file
ca_albers_crs: 3310
# management uncertainty (023)
start_year: 2016
end_year: 2023
site_id: "1000025731"
# crop identity -> N rate mapping (021 + 023)
crop_lookup:
landiq_parquet: "ccmmf_dir/LandIQ-harmonized-v3/crops_all_years.parq"
pft_table_csv: "ccmmf_dir/cadwr-landuse/data/CARB_PFTs_table.csv"
crosswalk_csv: "ccmmf_dir/management/fertilization/crop_type_crosswalk.csv"
Copy link

Copilot AI Feb 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The paths under lines 10, 11, and 33-35 contain literal string "ccmmf_dir" instead of using YAML anchors or substitution. These will be interpreted as literal directory names like "ccmmf_dir/modelout/..." rather than substituting the value from line 6. YAML does not perform variable substitution by default. Consider using: (1) YAML anchors and aliases, (2) a templating system, or (3) code-level path resolution using file.path(cfg$paths$ccmmf_dir, "modelout/..."). The current configuration will fail at runtime.

Copilot uses AI. Check for mistakes.
Copy link
Collaborator Author

@divine7022 divine7022 Feb 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch! ccmmf_dir/... paths on L10-11 are unused by the global SA pipeline (those are local SA keys);

should be fixed.

Comment on lines +213 to +226
design_points_csv = cfg$sites$design_points_file %||%
"data_raw/design_points_198.csv",
landiq_parquet = cfg$crop_lookup$landiq_parquet %||%
"/projectnb2/dietzelab/ccmmf/LandIQ-harmonized-v3/crops_all_years.parq",
pft_table_csv = cfg$crop_lookup$pft_table_csv %||%
"/projectnb2/dietzelab/abv1/ccmmf/cadwr-landuse/data/CARB_PFTs_table.csv",
crosswalk_csv = cfg$crop_lookup$crosswalk_csv %||%
"/projectnb2/dietzelab/ccmmf/management/fertilization/crop_type_crosswalk.csv",
year = as.integer(cfg$crop_lookup$landiq_year %||% 2023L), # (most recent)
season = as.integer(cfg$crop_lookup$landiq_season %||% 2L)
# NB crop identity is assumed constant across simulation years --
# this is a simplification for annual rotations.
# TODO use per-year LandIQ when rotation data is available
)
Copy link

Copilot AI Feb 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The script uses hard-coded file paths with ccmmf_dir prefix in the config defaults (lines 216-220). These paths appear to be system-specific (e.g., "/projectnb2/dietzelab/ccmmf/..."). If the ccmmf_dir is not properly set in the config, the script will use these hard-coded paths which may not exist on other systems. Consider either removing the defaults or making them relative paths, and documenting the required config structure.

Copilot uses AI. Check for mistakes.
Comment on lines +195 to +198
PEcAn.logger::logger.severe(sprintf(
"variable %s for runid %s has %d values (expected %d)",
v, rid, length(Y), expected_len
))
Copy link

Copilot AI Feb 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The function expects exactly N * (length(params) + 2) values in Y and will raise a severe error otherwise. However, if runs fail or are incomplete, this will cause the entire computation to abort. Consider adding an option to handle incomplete results, perhaps by padding with NA or excluding incomplete sites, especially since the comment in line 98 of 024_run_global_sensitivity.R mentions stop.on.error = FALSE. The current strict validation is inconsistent with allowing runs to fail.

Suggested change
PEcAn.logger::logger.severe(sprintf(
"variable %s for runid %s has %d values (expected %d)",
v, rid, length(Y), expected_len
))
PEcAn.logger::logger.warn(sprintf(
"variable %s for runid %s has %d values (expected %d) -- skipping",
v, rid, length(Y), expected_len
))
next

Copilot uses AI. Check for mistakes.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for sobol SA, all N*(k+2) runs are mathematically required by the Saltelli estimator. Skipping incomplete runs would produce incorrect indices without warning. The strict validation is intentional and consistent with the stop.on.error = FALSE in 024, which allows the MODEL step to continue but ensures we catch failures at the INDEX COMPUTATION step

<delete.raw>FALSE</delete.raw>
<binary>sipnet.git</binary>
<prerun>cp data/events.in @RUNDIR@</prerun>
<binary>sipnet</binary>
Copy link

Copilot AI Feb 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The model binary was changed from "sipnet.git" to "sipnet". Ensure that the PEcAn installation has the "sipnet" binary in the expected location. If "sipnet.git" was a specific build or version identifier, this change might cause runtime failures if the binary is not found. Consider documenting why this change was made.

Suggested change
<binary>sipnet</binary>
<!-- The SIPNET binary name is "sipnet.git" for the git-revision build. -->
<binary>sipnet.git</binary>

Copilot uses AI. Check for mistakes.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's just a sys link, binary can renamed to any

<delete.raw>FALSE</delete.raw>
<binary>sipnet.git</binary>
<prerun>cp data/events.in @RUNDIR@</prerun>
<binary>sipnet</binary>
Copy link

Copilot AI Feb 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The prerun command cp data/events.in @RUNDIR@ was removed. This suggests events are now handled differently (via the new events input registration in setEnsemblePaths). However, if any legacy workflows or documentation reference this prerun command, they will need updating. Ensure this change is documented and that the new events mechanism is fully functional before removing the fallback.

Copilot uses AI. Check for mistakes.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removal of <prerun> is intentional and is the core architectural change in this PR. Events are now registered via setEnsemblePaths(), write.configs.SIPNET copies inputs$events$path into each run directory automatically. The old cp prerun was a workaround


# config
cfg <- yaml::read_yaml(args$config)
settings <- PEcAn.settings::read.settings(cfg$settings_xml)
Copy link

Copilot AI Feb 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The script reads cfg$settings_xml directly on line 37 without checking if it exists or providing a fallback. If this key is missing from the config, the script will fail with an unclear error when trying to read NULL as a file path. Add a validation check similar to 021_generate_sobol_design.R, or provide a default value.

Copilot uses AI. Check for mistakes.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if NULL, read.settings(NULL) will throw a clear error; but noted..

@divine7022
Copy link
Collaborator Author

divine7022 commented Mar 5, 2026

@dlebauer dlebauer requested a review from Copilot March 5, 2026 16:38
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 18 out of 18 changed files in this pull request and generated 8 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

PEcAn.logger::logger.severe("'ic_size' and 'met_size' must be >= 1")
}

set.seed(42)
Copy link

Copilot AI Mar 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The seed is hard-coded to 42 with no way to override it, making reproducibility invisible to callers and preventing legitimate sensitivity-to-seed studies. Consider adding a seed parameter (e.g., seed = 42L) to generate_sobol_design() so callers can control or document the seed explicitly.

Copilot uses AI. Check for mistakes.
params,
N,
R = 500L) {

Copy link

Copilot AI Mar 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The expected output length formula N * (length(params) + 2) matches the Saltelli scheme used in generate_sobol_design(), which produces N * (k + 2) rows (A, B, and k AB matrices). However, params passed here already includes ic_ensemble and met_ensemble (appended in script 025), so length(params) equals k (all inputs including IC/met). This is consistent with design generation. That said, compute_sobol_indices accepts a generic params argument, and if a caller accidentally passes only the continuous params (without IC/met), the length check would silently use the wrong expected length. Consider validating or documenting that params must include IC/met entries to match the design.

Suggested change
# Validate that 'params' matches the Sobol design used to generate outputs.
# It must be the full parameter set, including 'ic_ensemble' and 'met_ensemble'.
if (!is.character(params) || length(params) == 0L) {
PEcAn.logger::logger.severe(
"'params' must be a non-empty character vector of parameter names (including 'ic_ensemble' and 'met_ensemble')."
)
}
required_params <- c("ic_ensemble", "met_ensemble")
if (!all(required_params %in% params)) {
PEcAn.logger::logger.severe(
paste0(
"'params' must include all parameter names used in the Sobol design, ",
"specifically: ", paste(required_params, collapse = ", "),
". Received params: ", paste(params, collapse = ", ")
)
)
}

Copilot uses AI. Check for mistakes.

set -euo pipefail

PROJ_DIR="/projectnb/dietzelab/abv1/ccmmf/uncertainty"
Copy link

Copilot AI Mar 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PROJ_DIR is hard-coded to a personal project directory path. This will not work for other users or CI environments without manual editing. It should be derived dynamically (e.g., via $(pwd) or $(dirname "$(realpath "$0")")/..) or passed as a parameter, consistent with how CONFIG is handled in the other shell scripts.

Suggested change
PROJ_DIR="/projectnb/dietzelab/abv1/ccmmf/uncertainty"
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]:-$0}")" && pwd)"
PROJ_DIR="${PROJ_DIR:-$(cd "${SCRIPT_DIR}/.." && pwd)}"

Copilot uses AI. Check for mistakes.
if (is.null(settings_xml) || !file.exists(settings_xml)) {
PEcAn.logger::logger.severe(
"Settings XML not found: ", settings_xml,
".Check 'settings_xml' in ", args$config
Copy link

Copilot AI Mar 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing space before "Check" in the error message.

Suggested change
".Check 'settings_xml' in ", args$config
". Check 'settings_xml' in ", args$config

Copilot uses AI. Check for mistakes.
baseline_events <- if (!is.null(baseline_raw$events)) baseline_raw$events else baseline_raw

# --- load anchor site events (monitoring framework) ---
# site keyed JSON with per type events for 17 anchor sites (defered scaling up).
Copy link

Copilot AI Mar 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"defered" should be "deferred".

Suggested change
# site keyed JSON with per type events for 17 anchor sites (defered scaling up).
# site keyed JSON with per type events for 17 anchor sites (deferred scaling up).

Copilot uses AI. Check for mistakes.
Comment on lines +233 to +235
}

.data <- rlang::.data No newline at end of file
Copy link

Copilot AI Mar 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assigning rlang::.data to a top-level variable .data in a source file is an anti-pattern. It pollutes the global namespace and shadows the .data pronoun used within dplyr verbs. The correct approach is to import it via #' @importFrom rlang .data in a package context, or simply reference it as rlang::.data inline. Since this is a sourced script (not a package), the cleanest fix is to remove this line and use .data directly in the dplyr calls (where it is already recognized) or qualify it as rlang::.data.

Suggested change
}
.data <- rlang::.data
}

Copilot uses AI. Check for mistakes.

# --- config ---
cfg <- yaml::read_yaml(args$config)
settings_xml <- args$settings %||% cfg$settings_xml %||% "data_raw/settings_sa.xml"
Copy link

Copilot AI Mar 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cfg is read with yaml::read_yaml(args$config), which returns a nested list where the settings XML is stored at cfg$default$settings_xml (as seen in 021_generate_sobol_design.R and 022_prepare_pecan_inputs.R). Accessing cfg$settings_xml (top-level) will always return NULL, causing this script to always fall through to the hard-coded default path "data_raw/settings_sa.xml" rather than honoring the config file value.

Suggested change
settings_xml <- args$settings %||% cfg$settings_xml %||% "data_raw/settings_sa.xml"
settings_xml <- args$settings %||% cfg$default$settings_xml %||% cfg$settings_xml %||% "data_raw/settings_sa.xml"

Copilot uses AI. Check for mistakes.
│ └── template.xml
├── scripts/
│ ├── 001_setup_design_points.R
| ├── 002_build_xml.R
Copy link

Copilot AI Mar 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The directory tree uses | (pipe) as the leading character on lines 68, 72, and 73, inconsistent with the (box-drawing character) used on all other tree lines. This causes the tree to render incorrectly.

Copilot uses AI. Check for mistakes.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
| ├── 002_build_xml.R
├── 002_build_xml.R

Copy link
Contributor

@dlebauer dlebauer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an excellent first pass at the analysis. This is the first half of my review; I have reviewed the analysis and R/*R code, and a few of the scripts. I will complete my review of the scripts/*R, tools, and README.md later.

In slack, @mdietze suggested using the existing PEcAn input type level global SA. That makes sense, but can be deferred and does not need to be addressed in this PR.

Requested changes are focused on correctness, clarity, and internal consistency.

Please defer anything that is not essential for MVP and correctness (including requested changes) by creating one or more follow up issues and/or a TODO file.

Changes requested:

  • Rename 'biological parameters' --> 'model parameters' or simply 'parameters' throughout. SIPNET parameters are not all strictly 'biological'. Also update other uses of the biological parameters concept, including bio_frac and interpretation.
  • Make sure that management uncertainty is included in analyses (results, interpretation, captions). It is present in analysis code but omitted/obscured in descriptions.
  • Make it clear when results refer to 'first order (main-effect) variance'. Several places read as if Si or sum(Si) represents total variance.
  • Disambiguate type-level vs per-input analysis; add clarification to README.

Additional requested fixes

  • Use "input" where appropriate rather than overloading "parameter". In several places ranked objects include drivers and mgmt inputs not just model parameters

Future work:

  • A separate type-level analysis using existing PEcAn Sobol functionality. This can be positioned upstream of this more detailed analysis.
  • coordinate with @ashiklom on functions in R/management_events.R.

Based on a Sobol analysis with `r N` base samples and `r total_runs` total runs across `r n_sites` sites:

- **Top parameters**: `r top3$Label[1]`, `r top3$Label[2]`, and `r top3$Label[3]` explain the most output variance ($T_i$ = `r round(top3$median_Ti[1], 2)`, `r round(top3$median_Ti[2], 2)`, `r round(top3$median_Ti[3], 2)`)
- **Biological parameters** account for \~`r round(bio_frac * 100)`% of explained variance; meteorological and initial condition drivers explain the rest
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- **Biological parameters** account for \~`r round(bio_frac * 100)`% of explained variance; meteorological and initial condition drivers explain the rest
- **Model parameters** account for ~`r round(bio_frac * 100)`% of the first-order variance contribution; the remaining contribution is distributed across meteorological drivers, initial conditions, and management inputs.

Code implies management should be included.


# Variance Decomposition (Source Partitioning) {#sec-variance-partition}

Before ranking individual parameters, we assess the *sources* of uncertainty. We sum the First-Order indices ($S_i$) by category (see @eq-si). The "Control (Noise)" segment is excluded to avoid confusion; it represents numerical estimation error from the Sobol design.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Before ranking individual parameters, we assess the *sources* of uncertainty. We sum the First-Order indices ($S_i$) by category (see @eq-si). The "Control (Noise)" segment is excluded to avoid confusion; it represents numerical estimation error from the Sobol design.
This section summarizes first-order (main-effect) uncertainty contributions by source (model parameters, meteorology, initial conditions, and management). This is done by summing first-order Sobol indices ($S_i$) within broad categories (@eq-si). These contributions do not include interaction effects (see Ti - Si).
For discrete inputs such as initial condition and meteorological ensembles, Sobol coordinates are used to select among ensemble members / files. The resulting indices quantify variance attributable to uncertainty in which ensemble member is selected, not sensitivity along an ordered or continuous variable.
The "Control (Noise)" source is excluded for clarity; it accounts for numerical estimation error from the Sobol design.


**Gap** ($T_i - S_i$): Higher-order interaction effects with other parameters

Parameters are ordered by median $T_i$ across sites. We show the top 15 per output variable.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

top 15 what?

Suggested change
Parameters are ordered by median $T_i$ across sites. We show the top 15 per output variable.
Parameters are ordered by median $T_i$ across sites. We show the top 15 inputs for each output variable.


# Parameter Ranking (First & Total Order) {#sec-rankings}

This plot compares First-Order ($S_i$, solid bar) and Total-Order ($T_i$, whisker) indices. The gap between them quantifies interaction strength (@eq-ti minus @eq-si).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
This plot compares First-Order ($S_i$, solid bar) and Total-Order ($T_i$, whisker) indices. The gap between them quantifies interaction strength (@eq-ti minus @eq-si).
This plot compares First-Order ($S_i$, solid bar) and Total-Order ($T_i$, whisker) indices. The gap between them quantifies interaction strength (@eq-ti minus @eq-si).
This ranking identifies the inputs with the largest total effect (Ti), indicating which uncertainties most strongly influence model outputs.


```{r}
#| label: fig-variance-partition
#| fig-cap: "Proportion of variance explained by Biology (parameters) vs. Environment (IC / met). This partitioning answers: is output uncertainty dominated by what we know about the plants, or by the weather and initial conditions?"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

indicate that this is proportion of first order variance

Suggested change
#| fig-cap: "Proportion of variance explained by Biology (parameters) vs. Environment (IC / met). This partitioning answers: is output uncertainty dominated by what we know about the plants, or by the weather and initial conditions?"
#| fig-cap: "First-order (main-effect) variance contribution by source group (model parameters, meteorology, initial conditions, management). Bars are based on summed Si values and do not include interaction effects."


# Model Additivity {#sec-additivity}

The sum of First-Order indices ($\sum S_i$) indicates the degree of model linearity. Values near 1.0 imply the model is **additive** (parameters contribute independently). Values well below 1 indicate that **interactions dominate** -- the model's response to one parameter depends on the values of others.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The sum of First-Order indices ($\sum S_i$) indicates the degree of model linearity. Values near 1.0 imply the model is **additive** (parameters contribute independently). Values well below 1 indicate that **interactions dominate** -- the model's response to one parameter depends on the values of others.
The sum of First-Order indices ($\sum S_i$) indicates the degree of model linearity. Values near 1.0 imply the model is **additive** and parameters contribute independently. Values well below 1 indicate that **interactions dominate** -- the model's response to one parameter depends on the values of others.

filter(Category != "Control (Noise)") |>
summarize(Total_Si = sum(Si, na.rm = TRUE), .by = c("variable", "Category")) |>
mutate(frac = Total_Si / sum(Total_Si), .by = "variable") |>
filter(Category == "Biological Parameter") |>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
filter(Category == "Biological Parameter") |>
filter(Category == "Model Parameter") |>

#' @param events List of event lists from baseline.
#' @param years Integer vector of simulation years.
#' @return Filtered/replicated event lists.
filter_events_to_years <- function(events, years) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is fine for now / MVP to get this working. But this should happen somewhere further up the pipeline, e.g. in IC_prep (?), not specific to uncertainty analysis.

And, in production, we should not silently repeat an event for unique(event_years)) unless the behavior is well documented. And again, not as a workaround within the uncertainty pipeline.

Suggested change
filter_events_to_years <- function(events, years) {
filter_events_to_years <- function(events, years) {
# TODO This is a stopgap for now; should be moved further up pipeline
# And ensure silent recycling is desired behavior

Comment on lines +100 to +105
PEcAn.logger::logger.warn(
"Unknown distribution '", prior$distn, "' for '", param_name,
"'. Falling back to uniform."
)
paramb <- if (is.null(prior$paramb)) prior$parama + 1 else prior$paramb
stats::qunif(x, min = prior$parama, max = paramb)
Copy link
Contributor

@dlebauer dlebauer Mar 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like a good place to error instead of warn. if (is.null(prior$paramb)) prior$parama + 1 else prior$paramb seems especially dangerous because it isn't clear if parama is a minimum or central tendency, or whether parama + 1 represents a sensible spread. Thus, the uniform fallback could easily generate an unreasonable prior distribution.

settings_xml = settings_xml,
timestamp = Sys.time()
)
saveRDS(metadata, output_meta)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a reason to save as rds instead of json?
We've been moving away from storing in R-specific binary formats; as text for smaller files.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants