Implement multi-site local sensitivity analysis workflow#1
Implement multi-site local sensitivity analysis workflow#1divine7022 wants to merge 28 commits intoccmmf:mainfrom
Conversation
|
BTW this looks really nice, well done. Though I'd like to work through the steps as I review. Could you please add a readme with instructions on how to use the repository? |
|
thanks for reviewing, updated readme in #2 |
infotroph
left a comment
There was a problem hiding this comment.
Just realized I left these comments unposted on my partial readthrough the other day, so adding now. Would it be useful for me to finish a full review now or wait for updates first?
| #---------------------------------- | ||
| # read.settings() uses xmlToList loses duplicate <variable> tags, so read XML directly | ||
| library(XML) | ||
| xml_doc <- XML::xmlParse(file.path("output", "pecan.CONFIGS.xml")) | ||
| sa_variables <- unique(XML::xpathSApply( | ||
| xml_doc, | ||
| "//sensitivity.analysis//variable", | ||
| XML::xmlValue | ||
| )) | ||
| XML::free(xml_doc) |
There was a problem hiding this comment.
| #---------------------------------- | |
| # read.settings() uses xmlToList loses duplicate <variable> tags, so read XML directly | |
| library(XML) | |
| xml_doc <- XML::xmlParse(file.path("output", "pecan.CONFIGS.xml")) | |
| sa_variables <- unique(XML::xpathSApply( | |
| xml_doc, | |
| "//sensitivity.analysis//variable", | |
| XML::xmlValue | |
| )) | |
| XML::free(xml_doc) | |
| sa_variables <- settings$sensitivity.analysis |> | |
| (\(x) x[names(x) == "variable"])() |> | |
| unlist() |> | |
| unique() |
| @@ -0,0 +1,108 @@ | |||
| #!/usr/bin/env Rscript | |||
There was a problem hiding this comment.
Does this site selection process differ from what David already implemented in the downscaling repo? Seems preferable to use an existing tool rather than add that complexity to the UA process.
There was a problem hiding this comment.
I have refactored this script to completely remove the clustering logic. Instead it now acts purely as a consumer of the shared design_points.csv artifact (currently the david's manually selected 198 sites) to perform the necessary PFT mapping and sub-sampling for UA.
We will treat this as the interface for now, and can revisit or refactor the integration logic once the broader coordination strategy between the Downscaling and Uncertainty workflows is finalized.
see comment #1 (comment)
| @@ -0,0 +1,497 @@ | |||
| #' Aggregate local sensitivity results across sites | |||
There was a problem hiding this comment.
I'm not sure what "local" means here
There was a problem hiding this comment.
good catch regarding the ambiguity
In this context, 'Local' refers to the mathematical method used by pecan's variance decomposition one-at-a-time (OAT) SA, where parameters are varied individually around their median to calculate partial derivatives (elasticities). This is distinct from the global sensitivity analysis we perform later.
I have updated the function documentation to explicitly state this aggregates one-at-a-time (OAT) parameter sensitivity results to avoid confusion with geographical locality.
| #' Aggregate local sensitivity results across sites | ||
| #' | ||
| #' @param sensitivity_outdir Directory containing PEcAn sensitivity outputs | ||
| #' @param design_points Data frame of site metadata |
There was a problem hiding this comment.
I recommend documenting what columns are expected in this df, and whether unexpected columns are ignored or cause errors
| ) | ||
|
|
||
| # Process all files | ||
| all_results <- purrr::map_dfr(sa_files, function(sa_file) { |
There was a problem hiding this comment.
This will be much easier to read and understand if you define it as a function with an informative name -- yes, even if it's only called once as all_results <- purrr::map_dfr(sa_files, name_of_function)
There was a problem hiding this comment.
I have extracted the file processing logic into a standalone helper function process_sa_file
Thanks for the comments, |
…rely as a consumer of the shared design_points.csv artifact
…t list extraction
| # Currently, this script consumes the 'design_points.csv' from the shared | ||
| # directory. At this stage of the project, these are MANUALLY SELECTED | ||
| # points (198 sites), not yet the output of the automated clustering | ||
| # workflow. | ||
| # | ||
| # TODO: Once the integration architecture between the Downscaling and | ||
| # Uncertainty repos is finalized, refactor this script, continue to consume | ||
| # the artifact generated by that pipeline. | ||
| # ======================================================================= |
dlebauer
left a comment
There was a problem hiding this comment.
Overall, this is very nice work. The code is clear and easy to understand.
I've read through the scripts and R functions. My next step will be to review the report. I've made a few comments. Not all need to be done now, but please write tickets or todos to capture future work.
| pecan_xml_template: "data_raw/template.xml" | ||
| sites: | ||
| design_points_file: "data_raw/sa_design_points.csv" | ||
| sensitivity: |
There was a problem hiding this comment.
The goal with this file was to handle settings that aren't in the pecan xml. Is there a reason that these are included here rather than the template.xml?
Minimizing config options here, and providing sensible defaults in the template.xml could make it more clear to end users.
There was a problem hiding this comment.
good point, template.xml already holds most pecan specific defaults. The items in 000-config.yml fall into two categories,
1 ) project level paths (ccmmf_dir, design_points, pecan_outdir) -- these are system specific and shouldn't live in template.xml since can change per machine
2 ) run configs (n_ensemble, n_met, start_date, end_date) -- moved these here from being hardcoded in
002_build_xml.R so users can configure them in one place. They feed into the XML building step
I have minimized config, let me know if you still have any specifics
| @@ -8,8 +8,19 @@ default: | |||
| raw_data_dir: "data_raw" | |||
| cache_dir: "cache" | |||
| pecan_outdir: "/projectnb2/dietzelab/ccmmf/modelout/ccmmf_phase_2b_mixed_pfts_20250701" | |||
There was a problem hiding this comment.
Once we have ccmmf_dir, can we change this to use that as a variable? That way it is only necessary to change the system-specific path once. (also applies to master_design points etc. )
| pecan_outdir: "/projectnb2/dietzelab/ccmmf/modelout/ccmmf_phase_2b_mixed_pfts_20250701" | |
| pecan_outdir: "$ccmmf_dir/modelout/ccmmf_phase_2b_mixed_pfts_20250701" |
There was a problem hiding this comment.
I looked into this, unfortunately yaml doesn't support variable substitution like $ccmmf_dir/modelout/...
If you'd prefer we could switch to an R based config
| raw_data_dir: "data_raw" | ||
| cache_dir: "cache" | ||
| pecan_outdir: "/projectnb2/dietzelab/ccmmf/modelout/ccmmf_phase_2b_mixed_pfts_20250701" | ||
| master_design_points: "/projectnb2/dietzelab/ccmmf/data/design_points.csv" |
There was a problem hiding this comment.
nit: these are just design points ...
| master_design_points: "/projectnb2/dietzelab/ccmmf/data/design_points.csv" | |
| design_points: "/projectnb2/dietzelab/ccmmf/data/design_points.csv" |
| cfg <- config::get(file = "000-config.yml") | ||
|
|
||
| # Define paths based on config | ||
| master_file <- cfg$paths$master_design_points |
There was a problem hiding this comment.
These may seem like trivial naming requests, but clarity and specificity is important.
- why is this 'master_design_points' ... that implies that there are other sets of design points.
- I would call this object
design_points_csvinstead of master_file - instead of master_data, I would call that object
design_points.
| ) | ||
| ) |> | ||
| dplyr::filter(!is.na(pft)) |> | ||
| dplyr::slice_sample(n = n_sample) |> |
There was a problem hiding this comment.
why are we sampling here?
There was a problem hiding this comment.
this i just to sample few site to test.
added a development/production mode
| # Prepares design points for Uncertainty Analysis. | ||
| # | ||
| # Currently, this script consumes the 'design_points.csv' from the shared | ||
| # directory. At this stage of the project, these are MANUALLY SELECTED |
There was a problem hiding this comment.
these were not manually selected, they were from the site selection
There was a problem hiding this comment.
updated comment to reference clustering
| #!/usr/bin/env Rscript | ||
|
|
||
| # ======================================================================= | ||
| # 001_setup_design_points.R |
There was a problem hiding this comment.
The fact that this script exists makes me wonder whether the clustering workflow that generates design_points.csv should be updated - that will help ensure consistency in workflows that consume it.
There was a problem hiding this comment.
yes, put mapping step (annual crop -> grass, woody perennial crop -> temperate.deciduous) should ideally be part of the clustering output; added a TODO to coordinate with downscaling
| raw_data_dir <- cfg$paths$raw_data_dir | ||
|
|
||
| # SA configuration (configurable paths) | ||
| options <- list( |
There was a problem hiding this comment.
it is unclear - why are configs hard coded here if they are already in the config file and/or template.xml?
There was a problem hiding this comment.
moved n_ens, n_met, start_date, end_date to the run: section in 000-config.yml. The script now reads all configurable values from config. Paths like ic_dir and met_dir are derived from ccmmf_dir in the script
| ) | ||
| site_info <- site_info |> dplyr::rename(id = site_id) | ||
|
|
||
| # Read template settings |
There was a problem hiding this comment.
consider - what is the minimum amount of information that we want the end users to be able to configure, and why?
In addition to file paths, I can see n_sample, start_date, end_date being useful for being able to trigger 'development' mode.
There was a problem hiding this comment.
current user facing config:
flags.production-- true/false (controls site count)sites.n_sample-- number of sites in dev moderun.start_date/run.end_daterun.n_ensemble/run.n_metpaths.ccmmf_dir
everything else is derived
| dir.create(settings$outdir, recursive = TRUE) | ||
| } | ||
|
|
||
| # Handle workflow resumption |
There was a problem hiding this comment.
is it worth the effort and overhead to manage a STATUS file? My recollection is that the STATUS file was developed primarily for the PHP web interface to display. It seems that dropping the status file would simplify the code a lot. What benefit does it provide, if any?
This is a context where targets might be useful.
There was a problem hiding this comment.
keeping it for now, STATUS file enables the --continue flag for workflow resumption, which is valuable when running large site SA jobs; these checks lets us skip completed steps on restart without re running everything
agree that targets would be a cleaner approach for this. Added to TODO list for future work
dlebauer
left a comment
There was a problem hiding this comment.
Overall the report is a nice analysis. A few changes could make the results easier to interpret.
- Read and implement "turning tables into graphs"
- Lead with a 3–5 bullet summary of the top findings (what parameters are most important to calibrate, how do these vary w/ environmental drivers).
- display all of the elasticity plots at once using faceting.
- show variance across sites in elasticity plot
- Show sites on a map similar to the one in showing the design points in the downscaling repo
- Replace site table with a rank‑distribution plot (median + IQR) and link to downloadable CSV.
Make gradient results a small multiple of partial dependence plots for top 3 parameters, rather than a long table.
|
|
||
| Parameter rankings by mean absolute elasticity identify which inputs most strongly influence model outputs, averaged across all environmental conditions. High elasticity indicates the model is highly sensitive to that parameter. | ||
|
|
||
| ::: panel-tabset |
There was a problem hiding this comment.
It would be helpful to be able to view all of these plots at the same time, either as panels or different color bars.
By default, I prefer to show medians, especially for non-normally distributed values. If they exist, consider showing the distribution of values (e.g., across sites)
There was a problem hiding this comment.
done, replaced the panel-tabset with a single faceted plot showing all response variables simultaneously. points show median signed elasticity across sites; whiskers show IQR
|
|
||
| ------------------------------------------------------------------------ | ||
|
|
||
| # Site-by-Site Sensitivity {#sec-site-sensitivity} |
There was a problem hiding this comment.
The site-by-site table is difficult to interpret and doesn't add much once across site variance is shown in the elasticity plot. Lets remove it. Andrew Gelman’s “Let’s practice what we preach: turning tables into graphs” makes a good case for visual summaries when tables get this dense: https://sites.stat.columbia.edu/gelman/research/published/dodhia.pdf
(The reason that I put a dense table of county aggregated means in the downscaling report is that those values are the numbers that the client wants)
A map would be a nice way to show the distribution of sites. It could eventually (not required for MVP) be interesting to map elasticity of the first, say, five parameters.
A few small usability notes for the style of table:
- when rendered as HTML, the table scrolls horizontally but headers don’t stay aligned.
- Two significant figures likely be enough; additional precision is noise.
- The cell color bars are helpful for relative magnitude.
There was a problem hiding this comment.
Consider something like a heat map. it isn't necessary to show every site, but perhaps summarize by environmental cluster?
See Dietze et al 2017 https://agupubs.onlinelibrary.wiley.com/doi/10.1002/2013JG002392
There was a problem hiding this comment.
replaced dense DT table with:
- CA site map (points colored by pff)
- boxplots showing the cross site distribution of elasticity for the top 10 parameters
CSV data remains available download
|
|
||
| # Technical Details {#sec-technical} | ||
|
|
||
| ## Sensitivity Metrics |
There was a problem hiding this comment.
link to this section above, or perhaps it could be in the beginning.
There was a problem hiding this comment.
added cross reference link to @sec-technical in the overview callout note where sensitivity metrics are first mentioned
| - **Negative slope**: Sensitivity *decreases* along gradient | ||
| - **High** $R^2$: Strong environmental control on parameter importance | ||
|
|
||
| ## Variance Explained Trends |
There was a problem hiding this comment.
its not clear why there are only two unique values of r2 in the table
There was a problem hiding this comment.
duplicate values occurred because the elasticity and variance_explained regression targets were producing identical results (same underlying data, same gradient variables). Restructured the gradient section to focus on signed elasticity only, should eliminates the duplicate
| ::: | ||
|
|
||
| ## Summary Statistics | ||
|
|
|
|
||
| Regression analysis tests whether parameter sensitivity systematically varies with environmental conditions, identifying context-dependent calibration needs. | ||
|
|
||
| ## Regression Results |
There was a problem hiding this comment.
one way these could be displayed is using facets
- each row a different parameter (for top params / greatest slopes)
- each column a different environmental covariate
- each plot has x = env covariate and y = elasticity. Could use color to plot multiple outputs (SOC, NPP, etc) on the same plot, or create separate plots for each output.
Another note on plots - by default, keep the axes (and other scales) the same size even across different variations of a plot (e.g. when all elasticity should have the same range. Sometimes it is necessary to change this role - e.g. to show interesting trends that would otherwise be obscured. This makes it easier to compare.
There was a problem hiding this comment.
Implemented as faceted scatter plots. Each panel shows one parameter response gradient combination with an OLS fit and R2 annotation. Filtered to the top 8 relationships by R2; and axes are consistent where possible
| parameter_rankings <- aggregated_results |> | ||
| dplyr::group_by(parameter, response_var) |> | ||
| dplyr::summarize( | ||
| # Mean absolute elasticity (primary metric for sensitivity) |
There was a problem hiding this comment.
taking the absolute value of elasticity makes sense when ranking parameter importance, but for plotting and regression, we should use untransformed elasticity.
There are lots of ways to plot this information but this is from LeBauer et al 2013 (shown here mostly to note that the direction of elasticity is also important information).
There was a problem hiding this comment.
good point, added signed (untransformed) elasticity summary stats to parameter_rankings:
mean_elasticity,median_elasticity(signed means for plotting)q25_elasticity,q75_elasticity(for IQR whiskers)
absolute elasticity (mean_abs_elasticity) is retained for ranking and priority scoring; all plots now use signed elasticity
|
Additional items for the "TODO" list
|
| # | ||
| # Reads the canonical design_points.csv produced by the downscaling | ||
| # clustering workflow (020_cluster_and_select_design_points.R). | ||
| # In development mode, subsets to n_sample sites for fast iteration. |
There was a problem hiding this comment.
these are not manually selected - these were generated by an earlier version of the downscaling workflow.
To ensure that we are using the correct version of an external dataset (in this case, design_points.csv), we should not check a copy into this repository.
We should have a shared understanding of where to write data and
| pecan_xml_template: "data_raw/template.xml" | ||
| sites: | ||
| design_points_file: "data_raw/sa_design_points.csv" | ||
| n_sample: 10 |
There was a problem hiding this comment.
I think it would be more clear to put n_sample and any other values that will be used in dev mode into a separate 'development' chunk
| dplyr::mutate( | ||
| pft = dplyr::case_when( | ||
| pft == "annual crop" ~ "grass", | ||
| pft == "woody perennial crop" ~ "temperate.deciduous", | ||
| TRUE ~ NA_character_ | ||
| ) | ||
| ) |> | ||
| dplyr::filter(!is.na(pft)) | ||
|
|
||
| if (production) { | ||
| # production: use all sites | ||
| final_data <- design_points |> | ||
| dplyr::select(site_id, lat, lon, pft) | ||
| PEcAn.logger::logger.info( | ||
| "Production mode: using all ", nrow(final_data), " sites" | ||
| ) | ||
| } else { | ||
| # development: random subset for fast testing | ||
| n_sample <- cfg$sites$n_sample %||% 10L | ||
| set.seed(42) | ||
| final_data <- design_points |> | ||
| dplyr::slice_sample(n = min(n_sample, nrow(design_points))) |> | ||
| dplyr::select(site_id, lat, lon, pft) | ||
| PEcAn.logger::logger.info( | ||
| "Development mode: sampled ", nrow(final_data), " of ", | ||
| nrow(design_points), " sites" | ||
| ) | ||
| } | ||
|
|
||
| readr::write_csv(final_data, out_file) |
There was a problem hiding this comment.
This is a small point to improve clarity & maintainability:
- add select to initial changes to design_points
- only needs to be done once
- then
if(production)just sends a messageif(dev)only subsets + sends message
| dplyr::mutate( | |
| pft = dplyr::case_when( | |
| pft == "annual crop" ~ "grass", | |
| pft == "woody perennial crop" ~ "temperate.deciduous", | |
| TRUE ~ NA_character_ | |
| ) | |
| ) |> | |
| dplyr::filter(!is.na(pft)) | |
| if (production) { | |
| # production: use all sites | |
| final_data <- design_points |> | |
| dplyr::select(site_id, lat, lon, pft) | |
| PEcAn.logger::logger.info( | |
| "Production mode: using all ", nrow(final_data), " sites" | |
| ) | |
| } else { | |
| # development: random subset for fast testing | |
| n_sample <- cfg$sites$n_sample %||% 10L | |
| set.seed(42) | |
| final_data <- design_points |> | |
| dplyr::slice_sample(n = min(n_sample, nrow(design_points))) |> | |
| dplyr::select(site_id, lat, lon, pft) | |
| PEcAn.logger::logger.info( | |
| "Development mode: sampled ", nrow(final_data), " of ", | |
| nrow(design_points), " sites" | |
| ) | |
| } | |
| readr::write_csv(final_data, out_file) | |
| dplyr::select(site_id, lat, lon, pft) |> | |
| dplyr::mutate( | |
| pft = dplyr::case_when( | |
| pft == "annual crop" ~ "grass", | |
| pft == "woody perennial crop" ~ "temperate.deciduous", | |
| TRUE ~ NA_character_ | |
| ) | |
| ) |> | |
| dplyr::filter(!is.na(pft)) | |
| num_sites <- nrow(design_points) | |
| if (production) { # use all sites | |
| PEcAn.logger::logger.info( | |
| "Production mode: using all ", num_sites, " sites" | |
| ) | |
| } else { # random subset for fast testing | |
| n_sample <- cfg$sites$n_sample %||% 10L | |
| set.seed(42) | |
| design_points <- design_points |> | |
| dplyr::slice_sample(n = min(n_sample, nrow(design_points))) | |
| PEcAn.logger::logger.info( | |
| "Development mode: sampled ", nrow(final_data), " of ", | |
| num_sites, " sites" | |
| ) | |
| } | |
| readr::write_csv(design_points, out_file) |
|
|
||
| ## Approach | ||
|
|
||
| We use the PEcAn variance decomposition workflow, which proceeds in three steps: |
There was a problem hiding this comment.
| We use the PEcAn variance decomposition workflow, which proceeds in three steps: | |
| We use the PEcAn variance decomposition workflow (LeBauer et al, 2013), which proceeds in three steps: |
|
|
||
| 1. **Meta-analysis.** A hierarchical Bayesian meta-analysis synthesizes published trait measurements to produce posterior probability distributions for each model parameter. These distributions capture both the central estimate and remaining uncertainty. | ||
|
|
||
| 2. **Sensitivity analysis.** Each parameter is perturbed one at a time to the quantile equivalents of $\pm 1\sigma$ and $\pm 2\sigma$ of its posterior distribution, and the model is re-run. The resulting response curves are fit with splines, and elasticity is computed as the normalized sensitivity: |
There was a problem hiding this comment.
| 2. **Sensitivity analysis.** Each parameter is perturbed one at a time to the quantile equivalents of $\pm 1\sigma$ and $\pm 2\sigma$ of its posterior distribution, and the model is re-run. The resulting response curves are fit with splines, and elasticity is computed as the normalized sensitivity: | |
| 2. **Sensitivity analysis.** The model is run first with all parameters are set to their posterior median values, and then with each parameter perturbed one at a time to the quantile equivalents of $\pm 1\sigma$ and $\pm 2\sigma$ of its posterior distribution, and the model is re-run. The resulting response curves are fit with splines, and elasticity is computed as the normalized sensitivity: |

Summary:
Implements a complete multi-site local sensitivity analysis (SA) workflow for SIPNET as specified in #151. This PR delivers aggregated sensitivity results across design points, environmental gradient analysis, and a Quarto-based report summarizing parameter importance patterns.
Addresses Issue #151 Requirements