Implement variance-based global sensitivity Analysis (Sobol) pipeline#2
Implement variance-based global sensitivity Analysis (Sobol) pipeline#2divine7022 wants to merge 41 commits intoccmmf:mainfrom
Conversation
R/global_sensitivity.R
Outdated
| #' @param ic_range Integer vector of available IC ensemble IDs. Should match | ||
| #' the number of IC files in settings XML (e.g. 1:20, 1:100). | ||
| #' @param met_range Integer vector of available met ensemble IDs. Should match | ||
| #' the number of met files in settings XML (e.g. 1:10). |
There was a problem hiding this comment.
should ic_range and met_range be required to be a sequential integer starting from 1? If so, would it be sufficient to have ic_size and met_size representing the size of the ensemble?
There was a problem hiding this comment.
Pull request overview
This PR implements a comprehensive multisite variance-based global sensitivity analysis (GSA) pipeline using Sobol indices for the SIPNET ecosystem model. The pipeline integrates Saltelli sampling for design generation, PEcAn workflow orchestration, management event generation with crop-specific fertilization rates, and variance decomposition analysis with visualization.
Changes:
- Added complete Sobol global sensitivity workflow (scripts 021-025) that generates quasi-random designs including parameters and environmental drivers (IC, met), converts to PEcAn format, executes model runs, and computes first/total-order Sobol indices
- Implemented management event generation system (script 023, R/management_events.R, R/crop_lookup.R) that maps quantile-based samples to crop-specific N fertilizer and compost rates using California agricultural databases
- Created Quarto analysis report (analysis/global_sensitivity.qmd) with variance decomposition visualizations, parameter ranking, interaction analysis, and factor screening using dummy parameter baseline
Reviewed changes
Copilot reviewed 17 out of 17 changed files in this pull request and generated 24 comments.
Show a summary per file
| File | Description |
|---|---|
| scripts/sge_array_launcher.sh | SGE job array launcher for parallel model runs |
| scripts/run_pipeline.sh | Master orchestration script for local SA → global SA → variance decomposition |
| scripts/run_local_sa.sh | Local sensitivity (OAT) pipeline wrapper |
| scripts/run_global_sa.sh | Global sensitivity (Sobol) pipeline wrapper calling 021-025 |
| scripts/021_generate_sobol_design.R | Generates Saltelli design matrices with params, IC, met using sensobol |
| scripts/022_prepare_pecan_inputs.R | Converts Sobol design to PEcAn samples.Rdata format |
| scripts/023_generate_management_events.R | Maps quantiles to crop-specific management events (N, compost) |
| scripts/024_run_global_sensitivity.R | PEcAn workflow execution with per-sample events registration |
| scripts/025_compute_sobol_indices.R | Computes bootstrapped Sobol indices from ensemble outputs |
| scripts/002_build_xml.R | Modified to set PFT posteriors and ensemble paths at config time |
| R/global_sensitivity.R | Core functions: generate_sobol_design(), compute_sobol_indices() |
| R/management_events.R | Event building functions for fertilization, with stubs for tillage/planting/harvest |
| R/crop_lookup.R | Joins LandIQ crop data with PEcAn lookup tables for site-specific N rates |
| analysis/global_sensitivity.qmd | Comprehensive Sobol analysis report with visualizations |
| 000-config.yml | Added sobol config section and crop lookup paths |
| data_raw/template.xml | Added events input, sensitivity.analysis block; changed qsub from SLURM to SGE |
| README.md | Updated with pipeline structure and Phase 2 script documentation |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| PEcAn.logger::logger.severe("'ic_size' and 'met_size' must be >= 1") | ||
| } | ||
|
|
||
| set.seed(42) |
There was a problem hiding this comment.
The seed is hard-coded to 42 in the function generate_sobol_design(). This makes all Sobol designs reproducible but prevents users from running multiple independent designs for validation or ensemble purposes. Consider adding a seed parameter with a default value of 42, or documenting this limitation prominently. For sensitivity analysis reproducibility is important, but users should be aware they cannot easily generate independent replicates.
There was a problem hiding this comment.
good point! but for this pipeline reproducibility is the primary concern; we may add a seed parameter in a future iteration if independent replicates are needed
| event_config = cfg$event_config %||% list() | ||
| ) | ||
|
|
||
| # write JSON (validated by PEcAn.data.land::validate_events_json) |
There was a problem hiding this comment.
The comment mentions validation by PEcAn.data.land::validate_events_json, but the code doesn't actually call this validation function. If validation is important (which it likely is for events), either call the function or update the comment to accurately reflect that validation is not performed here.
| # write JSON (validated by PEcAn.data.land::validate_events_json) | |
| # write events JSON for this sample |
| input_design <- readRDS("cache/input_design.rds") | ||
|
|
||
| status_file <- file.path(settings$outdir, "STATUS") | ||
| if (!args$continue && file.exists(status_file)) file.remove(status_file) |
There was a problem hiding this comment.
The script removes the status file with file.remove(status_file) when --continue is not specified but doesn't check if the file exists first. While file.remove() returns FALSE silently if the file doesn't exist, this could mask issues. Consider checking file.exists(status_file) before removal, or wrap in a tryCatch to handle potential permission errors more gracefully.
There was a problem hiding this comment.
file.exists() check is already present in if condition on the same line
000-config.yml
Outdated
| pecan_outdir: "ccmmf_dir/modelout/ccmmf_phase_2b_mixed_pfts_20250701" | ||
| master_design_points: "ccmmf_dir/data/design_points.csv" | ||
| pecan_xml_template: "data_raw/template.xml" | ||
| sites: | ||
| design_points_file: "data_raw/sa_design_points.csv" | ||
| n_sample: 10 | ||
| sensitivity: | ||
| sigma_levels: [-2, -1, 1, 2] # [-3, -2, -1, 1, 2, 3] # Standard | ||
| # Or use [-2, -1, 1, 2] for faster runs | ||
| # ensemble: | ||
| # size: 20 | ||
| # n_met: 10 | ||
| # run: | ||
| # start_date: "2016-01-01" | ||
| # end_date: "2023-12-31" | ||
| projection: | ||
| ca_albers_crs: 3310 No newline at end of file | ||
| ca_albers_crs: 3310 | ||
| # management uncertainty (023) | ||
| start_year: 2016 | ||
| end_year: 2023 | ||
| site_id: "1000025731" | ||
| # crop identity -> N rate mapping (021 + 023) | ||
| crop_lookup: | ||
| landiq_parquet: "ccmmf_dir/LandIQ-harmonized-v3/crops_all_years.parq" | ||
| pft_table_csv: "ccmmf_dir/cadwr-landuse/data/CARB_PFTs_table.csv" | ||
| crosswalk_csv: "ccmmf_dir/management/fertilization/crop_type_crosswalk.csv" |
There was a problem hiding this comment.
The paths under lines 10, 11, and 33-35 contain literal string "ccmmf_dir" instead of using YAML anchors or substitution. These will be interpreted as literal directory names like "ccmmf_dir/modelout/..." rather than substituting the value from line 6. YAML does not perform variable substitution by default. Consider using: (1) YAML anchors and aliases, (2) a templating system, or (3) code-level path resolution using file.path(cfg$paths$ccmmf_dir, "modelout/..."). The current configuration will fail at runtime.
There was a problem hiding this comment.
good catch! ccmmf_dir/... paths on L10-11 are unused by the global SA pipeline (those are local SA keys);
should be fixed.
scripts/021_generate_sobol_design.R
Outdated
| design_points_csv = cfg$sites$design_points_file %||% | ||
| "data_raw/design_points_198.csv", | ||
| landiq_parquet = cfg$crop_lookup$landiq_parquet %||% | ||
| "/projectnb2/dietzelab/ccmmf/LandIQ-harmonized-v3/crops_all_years.parq", | ||
| pft_table_csv = cfg$crop_lookup$pft_table_csv %||% | ||
| "/projectnb2/dietzelab/abv1/ccmmf/cadwr-landuse/data/CARB_PFTs_table.csv", | ||
| crosswalk_csv = cfg$crop_lookup$crosswalk_csv %||% | ||
| "/projectnb2/dietzelab/ccmmf/management/fertilization/crop_type_crosswalk.csv", | ||
| year = as.integer(cfg$crop_lookup$landiq_year %||% 2023L), # (most recent) | ||
| season = as.integer(cfg$crop_lookup$landiq_season %||% 2L) | ||
| # NB crop identity is assumed constant across simulation years -- | ||
| # this is a simplification for annual rotations. | ||
| # TODO use per-year LandIQ when rotation data is available | ||
| ) |
There was a problem hiding this comment.
The script uses hard-coded file paths with ccmmf_dir prefix in the config defaults (lines 216-220). These paths appear to be system-specific (e.g., "/projectnb2/dietzelab/ccmmf/..."). If the ccmmf_dir is not properly set in the config, the script will use these hard-coded paths which may not exist on other systems. Consider either removing the defaults or making them relative paths, and documenting the required config structure.
| PEcAn.logger::logger.severe(sprintf( | ||
| "variable %s for runid %s has %d values (expected %d)", | ||
| v, rid, length(Y), expected_len | ||
| )) |
There was a problem hiding this comment.
The function expects exactly N * (length(params) + 2) values in Y and will raise a severe error otherwise. However, if runs fail or are incomplete, this will cause the entire computation to abort. Consider adding an option to handle incomplete results, perhaps by padding with NA or excluding incomplete sites, especially since the comment in line 98 of 024_run_global_sensitivity.R mentions stop.on.error = FALSE. The current strict validation is inconsistent with allowing runs to fail.
| PEcAn.logger::logger.severe(sprintf( | |
| "variable %s for runid %s has %d values (expected %d)", | |
| v, rid, length(Y), expected_len | |
| )) | |
| PEcAn.logger::logger.warn(sprintf( | |
| "variable %s for runid %s has %d values (expected %d) -- skipping", | |
| v, rid, length(Y), expected_len | |
| )) | |
| next |
There was a problem hiding this comment.
for sobol SA, all N*(k+2) runs are mathematically required by the Saltelli estimator. Skipping incomplete runs would produce incorrect indices without warning. The strict validation is intentional and consistent with the stop.on.error = FALSE in 024, which allows the MODEL step to continue but ensures we catch failures at the INDEX COMPUTATION step
| <delete.raw>FALSE</delete.raw> | ||
| <binary>sipnet.git</binary> | ||
| <prerun>cp data/events.in @RUNDIR@</prerun> | ||
| <binary>sipnet</binary> |
There was a problem hiding this comment.
The model binary was changed from "sipnet.git" to "sipnet". Ensure that the PEcAn installation has the "sipnet" binary in the expected location. If "sipnet.git" was a specific build or version identifier, this change might cause runtime failures if the binary is not found. Consider documenting why this change was made.
| <binary>sipnet</binary> | |
| <!-- The SIPNET binary name is "sipnet.git" for the git-revision build. --> | |
| <binary>sipnet.git</binary> |
There was a problem hiding this comment.
it's just a sys link, binary can renamed to any
| <delete.raw>FALSE</delete.raw> | ||
| <binary>sipnet.git</binary> | ||
| <prerun>cp data/events.in @RUNDIR@</prerun> | ||
| <binary>sipnet</binary> |
There was a problem hiding this comment.
The prerun command cp data/events.in @RUNDIR@ was removed. This suggests events are now handled differently (via the new events input registration in setEnsemblePaths). However, if any legacy workflows or documentation reference this prerun command, they will need updating. Ensure this change is documented and that the new events mechanism is fully functional before removing the fallback.
There was a problem hiding this comment.
removal of <prerun> is intentional and is the core architectural change in this PR. Events are now registered via setEnsemblePaths(), write.configs.SIPNET copies inputs$events$path into each run directory automatically. The old cp prerun was a workaround
scripts/022_prepare_pecan_inputs.R
Outdated
|
|
||
| # config | ||
| cfg <- yaml::read_yaml(args$config) | ||
| settings <- PEcAn.settings::read.settings(cfg$settings_xml) |
There was a problem hiding this comment.
The script reads cfg$settings_xml directly on line 37 without checking if it exists or providing a fallback. If this key is missing from the config, the script will fail with an unclear error when trying to read NULL as a file path. Add a validation check similar to 021_generate_sobol_design.R, or provide a default value.
There was a problem hiding this comment.
if NULL, read.settings(NULL) will throw a clear error; but noted..
…sing points from geo
|
global SA report of mvp runs -- https://drive.google.com/file/d/15NEWLViccNnGDx4ooy33Dq-NrCKdZ3j8/view?usp=sharing |
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 18 out of 18 changed files in this pull request and generated 8 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| PEcAn.logger::logger.severe("'ic_size' and 'met_size' must be >= 1") | ||
| } | ||
|
|
||
| set.seed(42) |
There was a problem hiding this comment.
The seed is hard-coded to 42 with no way to override it, making reproducibility invisible to callers and preventing legitimate sensitivity-to-seed studies. Consider adding a seed parameter (e.g., seed = 42L) to generate_sobol_design() so callers can control or document the seed explicitly.
| params, | ||
| N, | ||
| R = 500L) { | ||
|
|
There was a problem hiding this comment.
The expected output length formula N * (length(params) + 2) matches the Saltelli scheme used in generate_sobol_design(), which produces N * (k + 2) rows (A, B, and k AB matrices). However, params passed here already includes ic_ensemble and met_ensemble (appended in script 025), so length(params) equals k (all inputs including IC/met). This is consistent with design generation. That said, compute_sobol_indices accepts a generic params argument, and if a caller accidentally passes only the continuous params (without IC/met), the length check would silently use the wrong expected length. Consider validating or documenting that params must include IC/met entries to match the design.
| # Validate that 'params' matches the Sobol design used to generate outputs. | |
| # It must be the full parameter set, including 'ic_ensemble' and 'met_ensemble'. | |
| if (!is.character(params) || length(params) == 0L) { | |
| PEcAn.logger::logger.severe( | |
| "'params' must be a non-empty character vector of parameter names (including 'ic_ensemble' and 'met_ensemble')." | |
| ) | |
| } | |
| required_params <- c("ic_ensemble", "met_ensemble") | |
| if (!all(required_params %in% params)) { | |
| PEcAn.logger::logger.severe( | |
| paste0( | |
| "'params' must include all parameter names used in the Sobol design, ", | |
| "specifically: ", paste(required_params, collapse = ", "), | |
| ". Received params: ", paste(params, collapse = ", ") | |
| ) | |
| ) | |
| } |
|
|
||
| set -euo pipefail | ||
|
|
||
| PROJ_DIR="/projectnb/dietzelab/abv1/ccmmf/uncertainty" |
There was a problem hiding this comment.
PROJ_DIR is hard-coded to a personal project directory path. This will not work for other users or CI environments without manual editing. It should be derived dynamically (e.g., via $(pwd) or $(dirname "$(realpath "$0")")/..) or passed as a parameter, consistent with how CONFIG is handled in the other shell scripts.
| PROJ_DIR="/projectnb/dietzelab/abv1/ccmmf/uncertainty" | |
| SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]:-$0}")" && pwd)" | |
| PROJ_DIR="${PROJ_DIR:-$(cd "${SCRIPT_DIR}/.." && pwd)}" |
| if (is.null(settings_xml) || !file.exists(settings_xml)) { | ||
| PEcAn.logger::logger.severe( | ||
| "Settings XML not found: ", settings_xml, | ||
| ".Check 'settings_xml' in ", args$config |
There was a problem hiding this comment.
Missing space before "Check" in the error message.
| ".Check 'settings_xml' in ", args$config | |
| ". Check 'settings_xml' in ", args$config |
| baseline_events <- if (!is.null(baseline_raw$events)) baseline_raw$events else baseline_raw | ||
|
|
||
| # --- load anchor site events (monitoring framework) --- | ||
| # site keyed JSON with per type events for 17 anchor sites (defered scaling up). |
There was a problem hiding this comment.
"defered" should be "deferred".
| # site keyed JSON with per type events for 17 anchor sites (defered scaling up). | |
| # site keyed JSON with per type events for 17 anchor sites (deferred scaling up). |
| } | ||
|
|
||
| .data <- rlang::.data No newline at end of file |
There was a problem hiding this comment.
Assigning rlang::.data to a top-level variable .data in a source file is an anti-pattern. It pollutes the global namespace and shadows the .data pronoun used within dplyr verbs. The correct approach is to import it via #' @importFrom rlang .data in a package context, or simply reference it as rlang::.data inline. Since this is a sourced script (not a package), the cleanest fix is to remove this line and use .data directly in the dplyr calls (where it is already recognized) or qualify it as rlang::.data.
| } | |
| .data <- rlang::.data | |
| } |
|
|
||
| # --- config --- | ||
| cfg <- yaml::read_yaml(args$config) | ||
| settings_xml <- args$settings %||% cfg$settings_xml %||% "data_raw/settings_sa.xml" |
There was a problem hiding this comment.
cfg is read with yaml::read_yaml(args$config), which returns a nested list where the settings XML is stored at cfg$default$settings_xml (as seen in 021_generate_sobol_design.R and 022_prepare_pecan_inputs.R). Accessing cfg$settings_xml (top-level) will always return NULL, causing this script to always fall through to the hard-coded default path "data_raw/settings_sa.xml" rather than honoring the config file value.
| settings_xml <- args$settings %||% cfg$settings_xml %||% "data_raw/settings_sa.xml" | |
| settings_xml <- args$settings %||% cfg$default$settings_xml %||% cfg$settings_xml %||% "data_raw/settings_sa.xml" |
| │ └── template.xml | ||
| ├── scripts/ | ||
| │ ├── 001_setup_design_points.R | ||
| | ├── 002_build_xml.R |
There was a problem hiding this comment.
The directory tree uses | (pipe) as the leading character on lines 68, 72, and 73, inconsistent with the │ (box-drawing character) used on all other tree lines. This causes the tree to render incorrectly.
There was a problem hiding this comment.
| | ├── 002_build_xml.R | |
| │ ├── 002_build_xml.R |
dlebauer
left a comment
There was a problem hiding this comment.
This is an excellent first pass at the analysis. This is the first half of my review; I have reviewed the analysis and R/*R code, and a few of the scripts. I will complete my review of the scripts/*R, tools, and README.md later.
In slack, @mdietze suggested using the existing PEcAn input type level global SA. That makes sense, but can be deferred and does not need to be addressed in this PR.
Requested changes are focused on correctness, clarity, and internal consistency.
Please defer anything that is not essential for MVP and correctness (including requested changes) by creating one or more follow up issues and/or a TODO file.
Changes requested:
- Rename 'biological parameters' --> 'model parameters' or simply 'parameters' throughout. SIPNET parameters are not all strictly 'biological'. Also update other uses of the biological parameters concept, including
bio_fracand interpretation. - Make sure that management uncertainty is included in analyses (results, interpretation, captions). It is present in analysis code but omitted/obscured in descriptions.
- Make it clear when results refer to 'first order (main-effect) variance'. Several places read as if Si or sum(Si) represents total variance.
- Disambiguate type-level vs per-input analysis; add clarification to README.
Additional requested fixes
- Use "input" where appropriate rather than overloading "parameter". In several places ranked objects include drivers and mgmt inputs not just model parameters
Future work:
- A separate type-level analysis using existing PEcAn Sobol functionality. This can be positioned upstream of this more detailed analysis.
- coordinate with @ashiklom on functions in R/management_events.R.
| Based on a Sobol analysis with `r N` base samples and `r total_runs` total runs across `r n_sites` sites: | ||
|
|
||
| - **Top parameters**: `r top3$Label[1]`, `r top3$Label[2]`, and `r top3$Label[3]` explain the most output variance ($T_i$ = `r round(top3$median_Ti[1], 2)`, `r round(top3$median_Ti[2], 2)`, `r round(top3$median_Ti[3], 2)`) | ||
| - **Biological parameters** account for \~`r round(bio_frac * 100)`% of explained variance; meteorological and initial condition drivers explain the rest |
There was a problem hiding this comment.
| - **Biological parameters** account for \~`r round(bio_frac * 100)`% of explained variance; meteorological and initial condition drivers explain the rest | |
| - **Model parameters** account for ~`r round(bio_frac * 100)`% of the first-order variance contribution; the remaining contribution is distributed across meteorological drivers, initial conditions, and management inputs. |
Code implies management should be included.
|
|
||
| # Variance Decomposition (Source Partitioning) {#sec-variance-partition} | ||
|
|
||
| Before ranking individual parameters, we assess the *sources* of uncertainty. We sum the First-Order indices ($S_i$) by category (see @eq-si). The "Control (Noise)" segment is excluded to avoid confusion; it represents numerical estimation error from the Sobol design. |
There was a problem hiding this comment.
| Before ranking individual parameters, we assess the *sources* of uncertainty. We sum the First-Order indices ($S_i$) by category (see @eq-si). The "Control (Noise)" segment is excluded to avoid confusion; it represents numerical estimation error from the Sobol design. | |
| This section summarizes first-order (main-effect) uncertainty contributions by source (model parameters, meteorology, initial conditions, and management). This is done by summing first-order Sobol indices ($S_i$) within broad categories (@eq-si). These contributions do not include interaction effects (see Ti - Si). | |
| For discrete inputs such as initial condition and meteorological ensembles, Sobol coordinates are used to select among ensemble members / files. The resulting indices quantify variance attributable to uncertainty in which ensemble member is selected, not sensitivity along an ordered or continuous variable. | |
| The "Control (Noise)" source is excluded for clarity; it accounts for numerical estimation error from the Sobol design. | |
|
|
||
| **Gap** ($T_i - S_i$): Higher-order interaction effects with other parameters | ||
|
|
||
| Parameters are ordered by median $T_i$ across sites. We show the top 15 per output variable. |
There was a problem hiding this comment.
top 15 what?
| Parameters are ordered by median $T_i$ across sites. We show the top 15 per output variable. | |
| Parameters are ordered by median $T_i$ across sites. We show the top 15 inputs for each output variable. |
|
|
||
| # Parameter Ranking (First & Total Order) {#sec-rankings} | ||
|
|
||
| This plot compares First-Order ($S_i$, solid bar) and Total-Order ($T_i$, whisker) indices. The gap between them quantifies interaction strength (@eq-ti minus @eq-si). |
There was a problem hiding this comment.
| This plot compares First-Order ($S_i$, solid bar) and Total-Order ($T_i$, whisker) indices. The gap between them quantifies interaction strength (@eq-ti minus @eq-si). | |
| This plot compares First-Order ($S_i$, solid bar) and Total-Order ($T_i$, whisker) indices. The gap between them quantifies interaction strength (@eq-ti minus @eq-si). | |
| This ranking identifies the inputs with the largest total effect (Ti), indicating which uncertainties most strongly influence model outputs. | |
|
|
||
| ```{r} | ||
| #| label: fig-variance-partition | ||
| #| fig-cap: "Proportion of variance explained by Biology (parameters) vs. Environment (IC / met). This partitioning answers: is output uncertainty dominated by what we know about the plants, or by the weather and initial conditions?" |
There was a problem hiding this comment.
indicate that this is proportion of first order variance
| #| fig-cap: "Proportion of variance explained by Biology (parameters) vs. Environment (IC / met). This partitioning answers: is output uncertainty dominated by what we know about the plants, or by the weather and initial conditions?" | |
| #| fig-cap: "First-order (main-effect) variance contribution by source group (model parameters, meteorology, initial conditions, management). Bars are based on summed Si values and do not include interaction effects." |
|
|
||
| # Model Additivity {#sec-additivity} | ||
|
|
||
| The sum of First-Order indices ($\sum S_i$) indicates the degree of model linearity. Values near 1.0 imply the model is **additive** (parameters contribute independently). Values well below 1 indicate that **interactions dominate** -- the model's response to one parameter depends on the values of others. |
There was a problem hiding this comment.
| The sum of First-Order indices ($\sum S_i$) indicates the degree of model linearity. Values near 1.0 imply the model is **additive** (parameters contribute independently). Values well below 1 indicate that **interactions dominate** -- the model's response to one parameter depends on the values of others. | |
| The sum of First-Order indices ($\sum S_i$) indicates the degree of model linearity. Values near 1.0 imply the model is **additive** and parameters contribute independently. Values well below 1 indicate that **interactions dominate** -- the model's response to one parameter depends on the values of others. |
| filter(Category != "Control (Noise)") |> | ||
| summarize(Total_Si = sum(Si, na.rm = TRUE), .by = c("variable", "Category")) |> | ||
| mutate(frac = Total_Si / sum(Total_Si), .by = "variable") |> | ||
| filter(Category == "Biological Parameter") |> |
There was a problem hiding this comment.
| filter(Category == "Biological Parameter") |> | |
| filter(Category == "Model Parameter") |> |
| #' @param events List of event lists from baseline. | ||
| #' @param years Integer vector of simulation years. | ||
| #' @return Filtered/replicated event lists. | ||
| filter_events_to_years <- function(events, years) { |
There was a problem hiding this comment.
This is fine for now / MVP to get this working. But this should happen somewhere further up the pipeline, e.g. in IC_prep (?), not specific to uncertainty analysis.
And, in production, we should not silently repeat an event for unique(event_years)) unless the behavior is well documented. And again, not as a workaround within the uncertainty pipeline.
| filter_events_to_years <- function(events, years) { | |
| filter_events_to_years <- function(events, years) { | |
| # TODO This is a stopgap for now; should be moved further up pipeline | |
| # And ensure silent recycling is desired behavior |
| PEcAn.logger::logger.warn( | ||
| "Unknown distribution '", prior$distn, "' for '", param_name, | ||
| "'. Falling back to uniform." | ||
| ) | ||
| paramb <- if (is.null(prior$paramb)) prior$parama + 1 else prior$paramb | ||
| stats::qunif(x, min = prior$parama, max = paramb) |
There was a problem hiding this comment.
This seems like a good place to error instead of warn. if (is.null(prior$paramb)) prior$parama + 1 else prior$paramb seems especially dangerous because it isn't clear if parama is a minimum or central tendency, or whether parama + 1 represents a sensible spread. Thus, the uniform fallback could easily generate an unreasonable prior distribution.
| settings_xml = settings_xml, | ||
| timestamp = Sys.time() | ||
| ) | ||
| saveRDS(metadata, output_meta) |
There was a problem hiding this comment.
is there a reason to save as rds instead of json?
We've been moving away from storing in R-specific binary formats; as text for smaller files.
Summary
Add a complete multisite variance-based global sensitivity analysis (GSA) pipeline based on Saltelli sampling and Sobol indices (sensobol). This PR implements generation of sobol designs that include parameters and drivers (IC, met) using the discrete-mapping approach recommended in the sensobol documentation, converts designs to pecan inputs, launches model runs, computes sobol indices, and adds a quarto report to visualize and summarize results.
This work implements the requirements of issue #152 (Multisite global (sobol) sensitivity) and lays the groundwork for Issue #153 (Variance decomposition / uncertainty partitioning).
scripts/021_generate_sobol_design.R--> orchestrates sobol design creation and saves design + metadata. Adds thedummyparameter for noise baselining.R/global_sensitivity.R--> core functiongenerate_sobol_design()(saltelli matrices via sensobol) with:ic_ensembleandmet_ensembleas sampled inputs,floor(qunif(..., min, max+1))per sensobol examples.scripts/022_prepare_pecan_inputs.R--> converts sobol design to pecantrait.samples/input_designand savessamples.Rdata+cache/input_design.rds.scripts/023_run_global_sensitivity.R--> run wrapper that readsinput_designand runs pecan workflow (note: script forcessettings$ensemble$size <- 1to ensure single-run per design row).scripts/024_compute_sobol_indices.R--> loads ensemble outputs, reconstructs Y, computes sobol indices withsensobol::sobol_indices()(bootstrapped), and savesdata/sobol_indices.csv.analysis/global_sensitivity.qmd--> quarto analysis/report to visualize first-order and total-order indices, interactions, variance partitioning (Biology vs Environment), fixable parameters using dummy/noise, and additivity checks.analysis/(artifact files) --> report.qmd/.htmland supporting files.